Capability
13 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →via “multi-language web-scale document collection with 40+ quality annotations”
30 trillion token web dataset with 40+ quality signals per document.
Unique: Processes 84 CommonCrawl dumps (claimed as most complete coverage vs. C4, Refinedweb, Dolma, SlimPajama) with 40+ pre-computed quality annotations per document, enabling fine-grained data curation research without requiring users to reprocess raw CommonCrawl. Open-source processing scripts allow reproducibility and custom filtering strategies on a standardized base dataset.
vs others: Larger scale (30 trillion tokens vs. C4's 156B tokens, RedPajama-1T's 1T tokens) with richer quality annotations (40+ signals vs. minimal metadata in competitors) and multilingual coverage, making it superior for comparative curation research and training diverse language models.
via “quality-filtering-with-language-specific-heuristics”
6.3T token multilingual dataset across 167 languages.
Unique: Applies language-family-aware filtering rules (separate thresholds for Latin, CJK, Indic, Arabic scripts) rather than universal heuristics, recognizing that character frequency distributions and valid repetition patterns differ dramatically across writing systems — most datasets use single global quality threshold regardless of language
vs others: More linguistically-informed than mC4's basic filtering and more transparent than OSCAR's undocumented quality pipeline, reducing the risk of removing legitimate low-resource language content while still eliminating spam and corruption
via “monthly crawl release coordination and versioning”
Largest open web crawl archive, foundation of all LLM training data.
Unique: Publishes monthly crawl snapshots with comprehensive statistics and errata tracking, enabling reproducible research and version-pinning. Each crawl is immutable and independently documented, supporting long-term archival and citation.
vs others: More transparent and reproducible than proprietary web data sources; monthly releases enable tracking of web evolution, whereas most competitors provide static or infrequently-updated snapshots.
via “multi-stage web data filtering pipeline”
Hugging Face's 15T token dataset, new standard for LLM training.
Unique: Combines learned quality classification (trained neural model) with statistical language detection and URL filtering in a staged pipeline, rather than rule-based heuristics alone. The quality classifier is trained on human-annotated examples, enabling nuanced detection of low-quality content beyond simple keyword/pattern matching.
vs others: Outperforms C4, Dolma, and RedPajama on downstream model benchmarks because it applies a learned quality classifier trained on curated examples rather than relying solely on heuristic rules or simpler statistical filters.
via “large-scale english text corpus filtering and deduplication”
Google's cleaned Common Crawl corpus used to train T5.
Unique: Uses deterministic heuristic-based filtering (length thresholds, keyword matching, language detection) applied at scale to 750GB of Common Crawl, enabling reproducible dataset creation without learned classifiers; includes sentence-level deduplication to remove redundant training examples
vs others: More transparent and reproducible than learned filtering approaches; larger and more thoroughly deduplicated than raw Common Crawl, but less sophisticated than newer datasets like Fineweb that use neural classifiers for quality scoring
Dataset by mlfoundations. 6,33,111 downloads.
Unique: Applies multi-stage quality filtering to Common Crawl 2023 PDFs using document completeness, text-image ratio, and language detection heuristics, reducing 1T+ tokens to 633K high-quality samples — unlike raw Common Crawl data requiring extensive downstream cleaning
vs others: Pre-filtered dataset eliminates need for manual quality assessment; curated subset is more suitable for training than raw Common Crawl; reduces data cleaning overhead compared to unfiltered web-scale datasets
via “large-scale web text corpus curation and filtering”
Dataset by HuggingFaceFW. 6,43,166 downloads.
Unique: Applies multi-stage filtering combining language detection, statistical quality metrics, and deduplication at Common Crawl scale (petabytes) to produce a single, reproducible 637B token English corpus — differs from ad-hoc web scraping by using standardized, publicly auditable filtering logic and preserving dataset versioning for research reproducibility
vs others: Larger and more carefully curated than raw Common Crawl dumps, yet more transparent and reproducible than proprietary datasets like those used in GPT-3/4, enabling open research on pretraining data quality
via “language-specific document filtering and quality ranking”
Dataset by allenai. 7,61,810 downloads.
Unique: C4's filtering is fully transparent and reproducible — the exact rules, thresholds, and blocklists are published and can be audited or modified. This contrasts with proprietary datasets where filtering logic is opaque. The approach uses language-specific metrics rather than one-size-fits-all rules, acknowledging that quality signals differ across scripts and languages.
vs others: C4's filtering is more transparent and auditable than proprietary datasets, while being simpler and more reproducible than learned quality models (which require labeled data and add complexity).
via “common crawl-sourced dataset with quality filtering and language detection”
Dataset by mlfoundations. 10,34,415 downloads.
Unique: Applies reproducible quality filtering to Common Crawl at scale, with transparent filtering criteria and public provenance — most proprietary datasets (Google, OpenAI) do not disclose filtering methods; most academic datasets are manually curated at smaller scale
vs others: Larger and more diverse than manually-curated datasets; more transparent and reproducible than proprietary web-scale datasets; enables research on real-world document distributions
via “common crawl 2023-14 snapshot filtering and deduplication”
Dataset by mlfoundations. 5,72,108 downloads.
Unique: Applies cross-crawl deduplication using content hashing to Common Crawl 2023-14 snapshot, eliminating redundant PDFs that appear in multiple crawl cycles — most web-scale datasets (LAION, C4) deduplicate within a single crawl but not across temporal snapshots
vs others: Provides cleaner, deduplicated content than raw Common Crawl while maintaining web-scale diversity; more authentic than manually curated datasets (DocVQA, RVL-CDIP) but less curated than academic paper collections (arXiv, S2-ORC)
via “common crawl pdf document sourcing and deduplication”
Dataset by mlfoundations. 7,96,577 downloads.
Unique: Leverages Common Crawl's pre-crawled WARC archives rather than performing independent web crawling, reducing infrastructure costs and ensuring reproducibility; applies URL canonicalization and optional content hashing for deduplication at scale
vs others: More cost-effective and reproducible than independent web crawling; larger and more diverse than manually curated document datasets, though with lower average quality due to lack of human filtering
via “common crawl pdf snapshot integration and versioning”
Dataset by mlfoundations. 8,57,357 downloads.
Unique: Provides versioned, reproducible access to specific Common Crawl PDF snapshot (2023-40) with full provenance tracking, enabling research reproducibility. Unlike generic Common Crawl access, includes pre-processed extraction and structured metadata.
vs others: More reproducible than direct Common Crawl access (which changes over time) while providing pre-processed documents unlike raw Common Crawl snapshots.
via “common crawl snapshot integration and temporal consistency”
Dataset by mlfoundations. 5,39,406 downloads.
Unique: Anchors entire dataset to a single Common Crawl snapshot (2023-06) with traceable WARC references, ensuring temporal consistency and reproducibility — most competing web-derived datasets either combine multiple crawl dates or lack explicit Common Crawl integration
vs others: More reproducible than datasets combining multiple crawl dates, and more verifiable than proprietary datasets without public provenance
Building an AI tool with “Common Crawl 2023 Pdf Document Filtering And Quality Curation”?
Submit your artifact →curl unfragile.ai/agents.md | sh© 2026 Unfragile. The platform for software for agents.