Multi Source Pretraining Data Composition With Documented Curation Rules

1

The PileDataset60/100

via “multi-domain pretraining corpus assembly”

EleutherAI's 825 GiB diverse training dataset from 22 sources.

Unique: Pioneered the multi-domain curation approach by intentionally combining 22 diverse, high-quality subsets (academic papers, books, code, web, specialized sources) rather than scraping a single massive web corpus. This architectural choice prioritizes knowledge breadth and domain coverage over raw scale, influencing the design of subsequent open datasets like LAION, RedPajama, and Falcon-Refinedweb.

vs others: Broader domain coverage than Common Crawl-only datasets (e.g., C4) and higher quality than raw web scrapes due to curation of academic, code, and book sources; smaller than Falcon-Refinedweb (1.5T tokens) but more carefully curated and widely adopted as a benchmark for model evaluation

2

DolmaDataset59/100

via “multi-source pretraining data composition with documented curation rules”

Allen AI's 3T token dataset for fully reproducible LLM training.

Unique: Dolma's distinguishing feature is comprehensive documentation of data curation decisions (exact filtering rules, deduplication methods via Duplodocus, mixing ratios) released alongside trained models (OLMo 7B, 32B), enabling full reproducibility. Most pretraining datasets (C4, The Pile, ROOTS) document composition at a high level but not the specific algorithmic rules applied. Dolma's integration with OlmoTrace enables tracing model outputs back to source training documents, providing data provenance that most datasets lack.

vs others: Dolma provides greater transparency and reproducibility than C4 or The Pile through documented filtering rules and deduplication specifications, while offering more diverse source coverage (code + academic + literary) than web-only datasets like C4, though it is smaller than ROOTS (1.6T vs 3T tokens) and less frequently updated than continuously-refreshed web crawl datasets.

3

FLAN CollectionDataset57/100

via “cross-domain task composition and sampling”

Google's 1,836-task instruction mixture for broad generalization.

Unique: Explicitly tracks and balances task representation across four heterogeneous source datasets and multiple semantic domains, using principled sampling to prevent any single source or domain from dominating training. This is more sophisticated than simple concatenation and enables reproducible, analyzable task composition.

vs others: More balanced and analytically transparent than ad-hoc dataset combinations, with explicit domain and source tracking that enables ablation studies and reproducible training recipes that other instruction datasets lack.

Top Matches

Also Known As

Company