Common Crawl 2023 14 Snapshot Filtering And Deduplication

1

RedPajama v2Dataset61/100

via “document-level deduplication with hash-based matching”

30 trillion token web dataset with 40+ quality signals per document.

Unique: Uses document-level hash-based deduplication (preserving document boundaries) rather than token-level or fuzzy matching, enabling reproducible filtering and transparent deduplication hashes that users can inspect and verify. Processes 84 CommonCrawl dumps with consistent deduplication methodology.

vs others: Document-level deduplication is more interpretable and reproducible than token-level approaches, and the published deduplication hashes enable users to understand and verify which documents were removed, unlike proprietary datasets that hide deduplication decisions.

2

FineWebDataset58/100

via “temporal web crawl composition and versioning”

Hugging Face's 15T token dataset, new standard for LLM training.

Unique: Explicitly combines 96 historical Common Crawl snapshots with cross-snapshot deduplication, creating a temporally diverse dataset rather than using a single recent snapshot. This architectural choice prevents recency bias and captures web content evolution, unlike C4 which uses a single snapshot.

vs others: Provides temporal diversity across 12 years of web content with unified deduplication, whereas C4 uses a single Common Crawl snapshot and RedPajama uses multiple snapshots without explicit cross-snapshot deduplication, potentially introducing snapshot-specific duplicates.

3

mC4Dataset58/100

via “common-crawl-snapshot-integration-and-versioning”

Multilingual web corpus covering 101 languages.

Unique: Provides explicit versioning tied to Common Crawl snapshots with full provenance metadata, enabling researchers to cite exact data sources and reproduce training runs. Integrates with Hugging Face Datasets versioning system for reproducible downloads across time.

vs others: More transparent data provenance than OSCAR (which obscures Common Crawl snapshot dates) and more reproducible than continuously-updated web corpora like C4, which change over time

4

C4 (Colossal Clean Crawled Corpus)Dataset57/100

via “large-scale english text corpus filtering and deduplication”

Google's cleaned Common Crawl corpus used to train T5.

Unique: Uses deterministic heuristic-based filtering (length thresholds, keyword matching, language detection) applied at scale to 750GB of Common Crawl, enabling reproducible dataset creation without learned classifiers; includes sentence-level deduplication to remove redundant training examples

vs others: More transparent and reproducible than learned filtering approaches; larger and more thoroughly deduplicated than raw Common Crawl, but less sophisticated than newer datasets like Fineweb that use neural classifiers for quality scoring

5

MINT-1T-PDF-CC-2023-23Dataset25/100

via “common crawl 2023 pdf document filtering and quality curation”

Dataset by mlfoundations. 6,33,111 downloads.

Unique: Applies multi-stage quality filtering to Common Crawl 2023 PDFs using document completeness, text-image ratio, and language detection heuristics, reducing 1T+ tokens to 633K high-quality samples — unlike raw Common Crawl data requiring extensive downstream cleaning

vs others: Pre-filtered dataset eliminates need for manual quality assessment; curated subset is more suitable for training than raw Common Crawl; reduces data cleaning overhead compared to unfiltered web-scale datasets

6

MINT-1T-PDF-CC-2023-14Dataset24/100

via “common crawl 2023-14 snapshot filtering and deduplication”

Dataset by mlfoundations. 5,72,108 downloads.

Unique: Applies cross-crawl deduplication using content hashing to Common Crawl 2023-14 snapshot, eliminating redundant PDFs that appear in multiple crawl cycles — most web-scale datasets (LAION, C4) deduplicate within a single crawl but not across temporal snapshots

vs others: Provides cleaner, deduplicated content than raw Common Crawl while maintaining web-scale diversity; more authentic than manually curated datasets (DocVQA, RVL-CDIP) but less curated than academic paper collections (arXiv, S2-ORC)

7

MINT-1T-PDF-CC-2023-50Dataset24/100

via “common crawl pdf document sourcing and deduplication”

Dataset by mlfoundations. 7,96,577 downloads.

Unique: Leverages Common Crawl's pre-crawled WARC archives rather than performing independent web crawling, reducing infrastructure costs and ensuring reproducibility; applies URL canonicalization and optional content hashing for deduplication at scale

vs others: More cost-effective and reproducible than independent web crawling; larger and more diverse than manually curated document datasets, though with lower average quality due to lack of human filtering

8

MINT-1T-PDF-CC-2023-40Dataset24/100

via “common crawl pdf snapshot integration and versioning”

Dataset by mlfoundations. 8,57,357 downloads.

Unique: Provides versioned, reproducible access to specific Common Crawl PDF snapshot (2023-40) with full provenance tracking, enabling research reproducibility. Unlike generic Common Crawl access, includes pre-processed extraction and structured metadata.

vs others: More reproducible than direct Common Crawl access (which changes over time) while providing pre-processed documents unlike raw Common Crawl snapshots.

9

MINT-1T-PDF-CC-2023-06Dataset24/100

via “common crawl snapshot integration and temporal consistency”

Dataset by mlfoundations. 5,39,406 downloads.

Unique: Anchors entire dataset to a single Common Crawl snapshot (2023-06) with traceable WARC references, ensuring temporal consistency and reproducibility — most competing web-derived datasets either combine multiple crawl dates or lack explicit Common Crawl integration

vs others: More reproducible than datasets combining multiple crawl dates, and more verifiable than proprietary datasets without public provenance

10

MINT-1T-PDF-CC-2024-18Dataset24/100

via “common crawl-sourced dataset with quality filtering and language detection”

Dataset by mlfoundations. 10,34,415 downloads.

Unique: Applies reproducible quality filtering to Common Crawl at scale, with transparent filtering criteria and public provenance — most proprietary datasets (Google, OpenAI) do not disclose filtering methods; most academic datasets are manually curated at smaller scale

vs others: Larger and more diverse than manually-curated datasets; more transparent and reproducible than proprietary web-scale datasets; enables research on real-world document distributions

Top Matches

Also Known As

Company