Common Crawl 2023 Pdf Document Filtering And Quality Curation

1

RedPajama v2Dataset60/100

via “multi-language web-scale document collection with 40+ quality annotations”

30 trillion token web dataset with 40+ quality signals per document.

Unique: Processes 84 CommonCrawl dumps (claimed as most complete coverage vs. C4, Refinedweb, Dolma, SlimPajama) with 40+ pre-computed quality annotations per document, enabling fine-grained data curation research without requiring users to reprocess raw CommonCrawl. Open-source processing scripts allow reproducibility and custom filtering strategies on a standardized base dataset.

vs others: Larger scale (30 trillion tokens vs. C4's 156B tokens, RedPajama-1T's 1T tokens) with richer quality annotations (40+ signals vs. minimal metadata in competitors) and multilingual coverage, making it superior for comparative curation research and training diverse language models.

2

CulturaXDataset59/100

via “quality-filtering-with-language-specific-heuristics”

6.3T token multilingual dataset across 167 languages.

Unique: Applies language-family-aware filtering rules (separate thresholds for Latin, CJK, Indic, Arabic scripts) rather than universal heuristics, recognizing that character frequency distributions and valid repetition patterns differ dramatically across writing systems — most datasets use single global quality threshold regardless of language

vs others: More linguistically-informed than mC4's basic filtering and more transparent than OSCAR's undocumented quality pipeline, reducing the risk of removing legitimate low-resource language content while still eliminating spam and corruption

3

Common CrawlDataset59/100

via “monthly crawl release coordination and versioning”

Largest open web crawl archive, foundation of all LLM training data.

Unique: Publishes monthly crawl snapshots with comprehensive statistics and errata tracking, enabling reproducible research and version-pinning. Each crawl is immutable and independently documented, supporting long-term archival and citation.

vs others: More transparent and reproducible than proprietary web data sources; monthly releases enable tracking of web evolution, whereas most competitors provide static or infrequently-updated snapshots.

4

FineWebDataset57/100

via “multi-stage web data filtering pipeline”

Hugging Face's 15T token dataset, new standard for LLM training.

Unique: Combines learned quality classification (trained neural model) with statistical language detection and URL filtering in a staged pipeline, rather than rule-based heuristics alone. The quality classifier is trained on human-annotated examples, enabling nuanced detection of low-quality content beyond simple keyword/pattern matching.

vs others: Outperforms C4, Dolma, and RedPajama on downstream model benchmarks because it applies a learned quality classifier trained on curated examples rather than relying solely on heuristic rules or simpler statistical filters.

5

C4 (Colossal Clean Crawled Corpus)Dataset56/100

via “large-scale english text corpus filtering and deduplication”

Google's cleaned Common Crawl corpus used to train T5.

Unique: Uses deterministic heuristic-based filtering (length thresholds, keyword matching, language detection) applied at scale to 750GB of Common Crawl, enabling reproducible dataset creation without learned classifiers; includes sentence-level deduplication to remove redundant training examples

vs others: More transparent and reproducible than learned filtering approaches; larger and more thoroughly deduplicated than raw Common Crawl, but less sophisticated than newer datasets like Fineweb that use neural classifiers for quality scoring

6

MINT-1T-PDF-CC-2023-23Dataset24/100

Dataset by mlfoundations. 6,33,111 downloads.

Unique: Applies multi-stage quality filtering to Common Crawl 2023 PDFs using document completeness, text-image ratio, and language detection heuristics, reducing 1T+ tokens to 633K high-quality samples — unlike raw Common Crawl data requiring extensive downstream cleaning

vs others: Pre-filtered dataset eliminates need for manual quality assessment; curated subset is more suitable for training than raw Common Crawl; reduces data cleaning overhead compared to unfiltered web-scale datasets

7

finewebDataset24/100

via “large-scale web text corpus curation and filtering”

Dataset by HuggingFaceFW. 6,43,166 downloads.

Unique: Applies multi-stage filtering combining language detection, statistical quality metrics, and deduplication at Common Crawl scale (petabytes) to produce a single, reproducible 637B token English corpus — differs from ad-hoc web scraping by using standardized, publicly auditable filtering logic and preserving dataset versioning for research reproducibility

vs others: Larger and more carefully curated than raw Common Crawl dumps, yet more transparent and reproducible than proprietary datasets like those used in GPT-3/4, enabling open research on pretraining data quality

8

c4Dataset24/100

via “language-specific document filtering and quality ranking”

Dataset by allenai. 7,61,810 downloads.

Unique: C4's filtering is fully transparent and reproducible — the exact rules, thresholds, and blocklists are published and can be audited or modified. This contrasts with proprietary datasets where filtering logic is opaque. The approach uses language-specific metrics rather than one-size-fits-all rules, acknowledging that quality signals differ across scripts and languages.

vs others: C4's filtering is more transparent and auditable than proprietary datasets, while being simpler and more reproducible than learned quality models (which require labeled data and add complexity).

9

MINT-1T-PDF-CC-2024-18Dataset23/100

via “common crawl-sourced dataset with quality filtering and language detection”

Dataset by mlfoundations. 10,34,415 downloads.

Unique: Applies reproducible quality filtering to Common Crawl at scale, with transparent filtering criteria and public provenance — most proprietary datasets (Google, OpenAI) do not disclose filtering methods; most academic datasets are manually curated at smaller scale

vs others: Larger and more diverse than manually-curated datasets; more transparent and reproducible than proprietary web-scale datasets; enables research on real-world document distributions

10

MINT-1T-PDF-CC-2023-14Dataset23/100

via “common crawl 2023-14 snapshot filtering and deduplication”

Dataset by mlfoundations. 5,72,108 downloads.

Unique: Applies cross-crawl deduplication using content hashing to Common Crawl 2023-14 snapshot, eliminating redundant PDFs that appear in multiple crawl cycles — most web-scale datasets (LAION, C4) deduplicate within a single crawl but not across temporal snapshots

vs others: Provides cleaner, deduplicated content than raw Common Crawl while maintaining web-scale diversity; more authentic than manually curated datasets (DocVQA, RVL-CDIP) but less curated than academic paper collections (arXiv, S2-ORC)

11

MINT-1T-PDF-CC-2023-50Dataset23/100

via “common crawl pdf document sourcing and deduplication”

Dataset by mlfoundations. 7,96,577 downloads.

Unique: Leverages Common Crawl's pre-crawled WARC archives rather than performing independent web crawling, reducing infrastructure costs and ensuring reproducibility; applies URL canonicalization and optional content hashing for deduplication at scale

vs others: More cost-effective and reproducible than independent web crawling; larger and more diverse than manually curated document datasets, though with lower average quality due to lack of human filtering

12

MINT-1T-PDF-CC-2023-40Dataset23/100

via “common crawl pdf snapshot integration and versioning”

Dataset by mlfoundations. 8,57,357 downloads.

Unique: Provides versioned, reproducible access to specific Common Crawl PDF snapshot (2023-40) with full provenance tracking, enabling research reproducibility. Unlike generic Common Crawl access, includes pre-processed extraction and structured metadata.

vs others: More reproducible than direct Common Crawl access (which changes over time) while providing pre-processed documents unlike raw Common Crawl snapshots.

13

MINT-1T-PDF-CC-2023-06Dataset23/100

via “common crawl snapshot integration and temporal consistency”

Dataset by mlfoundations. 5,39,406 downloads.

Unique: Anchors entire dataset to a single Common Crawl snapshot (2023-06) with traceable WARC references, ensuring temporal consistency and reproducibility — most competing web-derived datasets either combine multiple crawl dates or lack explicit Common Crawl integration

vs others: More reproducible than datasets combining multiple crawl dates, and more verifiable than proprietary datasets without public provenance

Top Matches

Also Known As

Company