Large Scale Web Text Corpus Curation And Filtering

1

RedPajama v2Dataset61/100

via “multi-language web-scale document collection with 40+ quality annotations”

30 trillion token web dataset with 40+ quality signals per document.

Unique: Processes 84 CommonCrawl dumps (claimed as most complete coverage vs. C4, Refinedweb, Dolma, SlimPajama) with 40+ pre-computed quality annotations per document, enabling fine-grained data curation research without requiring users to reprocess raw CommonCrawl. Open-source processing scripts allow reproducibility and custom filtering strategies on a standardized base dataset.

vs others: Larger scale (30 trillion tokens vs. C4's 156B tokens, RedPajama-1T's 1T tokens) with richer quality annotations (40+ signals vs. minimal metadata in competitors) and multilingual coverage, making it superior for comparative curation research and training diverse language models.

2

The PileDataset60/100

via “web-scale text corpus with deduplication and quality filtering”

EleutherAI's 825 GiB diverse training dataset from 22 sources.

Unique: Combines Reddit-curated web text (OpenWebText2) with filtered Common Crawl (Pile-CC) rather than relying on raw Common Crawl alone, applying implicit quality filtering through Reddit curation and explicit deduplication/filtering on Pile-CC. This hybrid approach balances web-scale coverage with quality, addressing a key limitation of earlier web-only datasets.

vs others: Higher quality than raw Common Crawl (e.g., C4) due to Reddit curation and filtering; broader coverage than Reddit-only datasets; comparable to Falcon-Refinedweb in approach but with less documented filtering methodology

3

CulturaXDataset60/100

via “quality-filtering-with-language-specific-heuristics”

6.3T token multilingual dataset across 167 languages.

Unique: Applies language-family-aware filtering rules (separate thresholds for Latin, CJK, Indic, Arabic scripts) rather than universal heuristics, recognizing that character frequency distributions and valid repetition patterns differ dramatically across writing systems — most datasets use single global quality threshold regardless of language

vs others: More linguistically-informed than mC4's basic filtering and more transparent than OSCAR's undocumented quality pipeline, reducing the risk of removing legitimate low-resource language content while still eliminating spam and corruption

4

DolmaDataset59/100

via “web text filtering and deduplication across common crawl and c4 sources”

Allen AI's 3T token dataset for fully reproducible LLM training.

Unique: Dolma's use of two complementary web sources (Common Crawl and C4) with source-specific filtering is distinctive because it balances raw coverage (Common Crawl) with pre-filtered quality (C4), providing diversity while maintaining standards. Most datasets use either raw crawls or pre-filtered sources, but not both. The documented filtering rules (though not detailed in available materials) enable reproducibility that most web datasets lack.

vs others: Dolma's dual-source web data provides greater transparency and reproducibility than C4 alone, while offering broader coverage than C4-only datasets, though it is smaller and less frequently updated than continuously-refreshed web crawl datasets.

5

OPUSDataset59/100

via “web-crawled general-domain parallel corpus aggregation”

Massive parallel corpus for machine translation.

Unique: Aggregates CCMatrix (17.1B pairs, 16.61% of collection), ParaCrawl (4.6B pairs, 4.50%), and WikiMatrix (933.6M pairs) providing 22.6B+ web-crawled and Wikipedia-based parallel sentences. CCMatrix alone is the third-largest corpus in OPUS, making web-crawled data a dominant component of the aggregation alongside subtitles and institutional sources.

vs others: Provides centralized access to multiple large-scale web-crawled corpora in a single interface, whereas accessing these sources individually requires visiting separate repositories; however, lacks quality filtering, deduplication across sources, and documentation of alignment confidence that specialized MT data providers offer.

6

mC4Dataset58/100

via “multilingual-text-corpus-extraction-from-web-crawl”

Multilingual web corpus covering 101 languages.

Unique: Processes Common Crawl at petabyte scale with language-aware segmentation across 101 languages, providing pre-filtered language-specific subsets rather than requiring downstream filtering. Uses probabilistic language ID to avoid expensive manual annotation while maintaining reasonable precision for high-resource languages.

vs others: Larger and more multilingual than OSCAR (85 languages) and more web-representative than Wikipedia-derived corpora, but with lower quality control than curated datasets like GLUE or SuperGLUE

7

FineWebDataset58/100

via “multi-stage web data filtering pipeline”

Hugging Face's 15T token dataset, new standard for LLM training.

Unique: Combines learned quality classification (trained neural model) with statistical language detection and URL filtering in a staged pipeline, rather than rule-based heuristics alone. The quality classifier is trained on human-annotated examples, enabling nuanced detection of low-quality content beyond simple keyword/pattern matching.

vs others: Outperforms C4, Dolma, and RedPajama on downstream model benchmarks because it applies a learned quality classifier trained on curated examples rather than relying solely on heuristic rules or simpler statistical filters.

8

C4 (Colossal Clean Crawled Corpus)Dataset57/100

via “large-scale english text corpus filtering and deduplication”

Google's cleaned Common Crawl corpus used to train T5.

Unique: Uses deterministic heuristic-based filtering (length thresholds, keyword matching, language detection) applied at scale to 750GB of Common Crawl, enabling reproducible dataset creation without learned classifiers; includes sentence-level deduplication to remove redundant training examples

vs others: More transparent and reproducible than learned filtering approaches; larger and more thoroughly deduplicated than raw Common Crawl, but less sophisticated than newer datasets like Fineweb that use neural classifiers for quality scoring

9

finewebDataset25/100

via “large-scale web text corpus curation and filtering”

Dataset by HuggingFaceFW. 6,43,166 downloads.

Unique: Applies multi-stage filtering combining language detection, statistical quality metrics, and deduplication at Common Crawl scale (petabytes) to produce a single, reproducible 637B token English corpus — differs from ad-hoc web scraping by using standardized, publicly auditable filtering logic and preserving dataset versioning for research reproducibility

vs others: Larger and more carefully curated than raw Common Crawl dumps, yet more transparent and reproducible than proprietary datasets like those used in GPT-3/4, enabling open research on pretraining data quality

10

c4Dataset25/100

via “multilingual web-scale text corpus ingestion and deduplication”

Dataset by allenai. 7,61,810 downloads.

Unique: C4 is built directly from Common Crawl snapshots with transparent, reproducible filtering and deduplication logic (published in the original paper), making it auditable and replicable — unlike proprietary datasets. It includes explicit language detection and URL-based quality filtering applied uniformly across 100+ languages, enabling fair multilingual representation.

vs others: C4 offers 10x larger scale and true multilingual coverage compared to English-only datasets like Wikipedia or BookCorpus, while maintaining open-source transparency and reproducibility that proprietary datasets (e.g., GPT-3's training data) cannot provide.

11

FineFineWebDataset24/100

via “large-scale web text corpus loading and streaming”

Dataset by m-a-p. 4,59,057 downloads.

Unique: Combines HuggingFace's distributed Parquet infrastructure with lazy-loading semantics, enabling researchers to train on multi-billion-token corpora without pre-downloading; uses columnar storage for efficient selective field access (e.g., text-only vs. text+metadata queries)

vs others: Faster iteration than Common Crawl raw dumps (no preprocessing overhead) and more accessible than proprietary web corpora (free, open-source, Apache 2.0 licensed); streaming approach outperforms local-only datasets like C4 for teams with bandwidth but limited storage

12

finephraseDataset24/100

via “filtered-educational-web-corpus-access”

Dataset by HuggingFaceFW. 4,74,259 downloads.

Unique: Leverages FineWeb-Edu's multi-stage filtering pipeline (deduplication, language detection, educational heuristics) rather than raw Common Crawl, resulting in ~10x higher signal-to-noise ratio. Provides transparent versioning and reproducibility through HuggingFace's dataset infrastructure, enabling audit trails for model training.

vs others: Higher quality and more curated than generic web corpora (Common Crawl, C4), but smaller and more specialized than general-purpose instruction datasets like The Pile or LAION.

13

fineweb-eduDataset24/100

via “large-scale educational text dataset curation and filtering”

Dataset by HuggingFaceFW. 4,14,812 downloads.

Unique: Applies educational domain classification and quality filtering on top of FineWeb's base curation, using heuristics tuned specifically for pedagogical content (e.g., educational institution detection, curriculum keywords, readability metrics) rather than generic web quality signals. Integrated with Hugging Face Hub for streaming access without full download.

vs others: More targeted for education use cases than raw Common Crawl or generic FineWeb, with pre-applied educational filtering that reduces downstream cleaning work compared to manually curating web sources or using unfiltered crawl data.

14

MINT-1T-PDF-CC-2024-18Dataset24/100

via “common crawl-sourced dataset with quality filtering and language detection”

Dataset by mlfoundations. 10,34,415 downloads.

Unique: Applies reproducible quality filtering to Common Crawl at scale, with transparent filtering criteria and public provenance — most proprietary datasets (Google, OpenAI) do not disclose filtering methods; most academic datasets are manually curated at smaller scale

vs others: Larger and more diverse than manually-curated datasets; more transparent and reproducible than proprietary web-scale datasets; enables research on real-world document distributions

15

MINT-1T-PDF-CC-2023-40Dataset24/100

via “large-scale text corpus for language model pretraining”

Dataset by mlfoundations. 8,57,357 downloads.

Unique: Derives 1 trillion tokens specifically from PDF documents rather than generic web crawls, capturing formal, structured writing with higher information density than typical web text. Preserves document-level context and structure signals that web-only corpora lose.

vs others: Complements web-text corpora (C4, The Pile) by providing document-sourced content with different statistical properties, useful for models requiring strong document understanding capabilities.

Top Matches

Also Known As

Company