Capability
3 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →via “language-specific-corpus-filtering-and-subset-selection”
Multilingual web corpus covering 101 languages.
Unique: Provides language-partitioned Parquet files enabling efficient columnar filtering without full corpus download. Supports both batch download and streaming APIs, allowing researchers to work with language subsets at different scales (100MB to 300GB) without infrastructure overhead.
vs others: More flexible language selection than OSCAR (which requires manual filtering) and more scalable than downloading Wikipedia dumps per language, with built-in streaming for memory-constrained environments
via “language-specific subset filtering and selective loading”
BigScience's curated multilingual dataset for BLOOM.
Unique: ROOTS organizes data with language as the primary partitioning key, enabling zero-copy subset selection at the Datasets API level — users can load only relevant languages without materializing the full corpus, a design choice that reduces memory overhead compared to post-hoc filtering on monolithic datasets.
vs others: Compared to monolithic pretraining datasets like C4, ROOTS's language-partitioned structure allows selective loading without downloading irrelevant data, reducing iteration time and storage costs for multilingual or language-specific training.
via “medical domain filtering and subset creation”
Dataset by lavita. 5,55,826 downloads.
Unique: Implements Arrow-level predicate pushdown for efficient filtering without materializing non-matching records. Supports both simple equality filters and complex Python predicates, with automatic optimization for common patterns.
vs others: More efficient than pandas filtering because Arrow evaluates predicates at storage layer; more flexible than SQL WHERE clauses because it supports arbitrary Python logic
Building an AI tool with “Language Specific Subset Filtering And Selective Loading”?
Submit your artifact →curl unfragile.ai/agents.md | sh© 2026 Unfragile. The platform for software for agents.