Capability
Streaming Dataset Access With Lazy Loading And Memory Efficiency
20 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “large-scale distributed dataset processing and streaming”
783 GB curated code dataset from 86 languages with PII redaction.
Unique: Distributed processing pipeline with Hugging Face Datasets integration for streaming access, enabling efficient handling of 783 GB without full in-memory loading — most competing datasets require downloading entire corpus
vs others: More scalable than CodeSearchNet (requires full download) and more flexible than GitHub-Code (no streaming API), enabling efficient training on resource-constrained hardware