Capability
6 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →via “multi-language web-scale document collection with 40+ quality annotations”
30 trillion token web dataset with 40+ quality signals per document.
Unique: Processes 84 CommonCrawl dumps (claimed as most complete coverage vs. C4, Refinedweb, Dolma, SlimPajama) with 40+ pre-computed quality annotations per document, enabling fine-grained data curation research without requiring users to reprocess raw CommonCrawl. Open-source processing scripts allow reproducibility and custom filtering strategies on a standardized base dataset.
vs others: Larger scale (30 trillion tokens vs. C4's 156B tokens, RedPajama-1T's 1T tokens) with richer quality annotations (40+ signals vs. minimal metadata in competitors) and multilingual coverage, making it superior for comparative curation research and training diverse language models.
via “large-scale english text corpus filtering and deduplication”
Google's cleaned Common Crawl corpus used to train T5.
Unique: Uses deterministic heuristic-based filtering (length thresholds, keyword matching, language detection) applied at scale to 750GB of Common Crawl, enabling reproducible dataset creation without learned classifiers; includes sentence-level deduplication to remove redundant training examples
vs others: More transparent and reproducible than learned filtering approaches; larger and more thoroughly deduplicated than raw Common Crawl, but less sophisticated than newer datasets like Fineweb that use neural classifiers for quality scoring
via “large-scale web text corpus curation and filtering”
Dataset by HuggingFaceFW. 6,43,166 downloads.
Unique: Applies multi-stage filtering combining language detection, statistical quality metrics, and deduplication at Common Crawl scale (petabytes) to produce a single, reproducible 637B token English corpus — differs from ad-hoc web scraping by using standardized, publicly auditable filtering logic and preserving dataset versioning for research reproducibility
vs others: Larger and more carefully curated than raw Common Crawl dumps, yet more transparent and reproducible than proprietary datasets like those used in GPT-3/4, enabling open research on pretraining data quality
via “large-scale-data-curation”
via “data-curation-and-filtering”
via “dataset quality assessment and curation”
Building an AI tool with “Large Scale Data Curation”?
Submit your artifact →curl unfragile.ai/agents.md | sh© 2026 Unfragile. The platform for software for agents.