The Pile
DatasetFreeEleutherAI's 825 GiB diverse training dataset from 22 sources.
Capabilities11 decomposed
multi-domain pretraining corpus assembly
Medium confidenceAggregates 22 discrete, high-quality English text datasets (academic papers, books, code, web text, specialized sources) into a unified 825 GiB jsonlines corpus compressed with zstandard. The assembly approach combines heterogeneous sources without documented deduplication or cross-domain filtering, enabling language models to learn from diverse knowledge domains in a single training pass. Data is stored as line-delimited JSON objects, one document per line, allowing streaming consumption by tokenizers and dataloaders without full decompression.
Combines 22 diverse, independently-curated datasets (academic, books, code, web, specialized) into a single unified corpus without applying documented deduplication or cross-domain filtering, preserving domain-specific characteristics while enabling broad knowledge coverage in a single training pass. This heterogeneous assembly approach contrasts with single-domain datasets (e.g., Books3 alone) or heavily preprocessed corpora that normalize domain distributions.
Broader domain coverage than Common Crawl alone or academic-only datasets; larger and more diverse than earlier open datasets like WikiText or BookCorpus, enabling models trained on Pile to generalize across code, patents, IRC, and academic papers simultaneously.
cross-domain model evaluation via pile bpb metric
Medium confidenceProvides a standardized evaluation benchmark (Pile Bits Per Byte / BPB) that measures language model perplexity across the full 22-domain corpus, enabling comparison of model generalization performance on diverse text types. The metric aggregates per-domain loss into a single scalar, with a public leaderboard tracking zero-shot performance of models trained on Pile and other datasets. Evaluation code is available but not fully documented in the artifact description.
Aggregates loss across 22 heterogeneous domains into a single BPB metric, enabling cross-domain generalization evaluation without requiring per-domain breakdowns. This contrasts with single-domain benchmarks (e.g., LAMBADA, WikiText) or multi-benchmark suites (GLUE, SuperGLUE) that require separate evaluation runs. The leaderboard provides public tracking of model performance, creating a shared reference point for open-source LLM development.
More comprehensive than single-domain perplexity metrics (e.g., WikiText-103 alone) because it measures generalization across code, patents, IRC, and academic papers simultaneously; simpler than multi-benchmark evaluation suites (GLUE, SuperGLUE) that require separate task-specific evaluations.
model-agnostic training data format and integration
Medium confidenceProvides training data in a model-agnostic jsonlines format that integrates with standard ML frameworks (PyTorch, TensorFlow, Hugging Face) without requiring custom preprocessing or format conversion. The jsonlines + zstandard approach enables seamless integration with existing dataloaders, tokenizers, and training pipelines, reducing friction for researchers adopting the dataset. No custom APIs or proprietary tools are required — standard open-source libraries suffice.
Uses standard, framework-agnostic jsonlines + zstandard format that integrates directly with PyTorch, TensorFlow, and Hugging Face without custom preprocessing or proprietary tools. This contrasts with proprietary formats (HDF5, custom binary formats) that require custom loaders, or single-framework datasets that lock users into specific ML libraries.
More portable than proprietary formats because it uses standard jsonlines; more efficient than uncompressed text because zstandard compression reduces storage by ~3-4x; simpler than database formats (SQLite, Parquet) because jsonlines requires no schema definition or query language.
component dataset composition and sourcing
Medium confidenceCurates and integrates 22 distinct text sources spanning academic (PubMed, ArXiv), books (Books3, Project Gutenberg), code (GitHub), web (OpenWebText2, Pile-CC), and specialized domains (USPTO patents, Ubuntu IRC, Stack Exchange, and others). Each component is sourced independently with its own collection methodology, licensing, and quality standards, then combined into a single corpus. The exact composition percentages, preprocessing applied per component, and license terms for individual datasets are not documented.
Combines 22 independently-sourced datasets (academic APIs, web crawls, code repositories, specialized archives) into a single corpus without documented composition percentages or per-component preprocessing. This 'black-box' curation approach enables broad coverage but obscures which domains drive model behavior. Contrasts with single-source datasets (e.g., Common Crawl alone) or fully documented pipelines (e.g., C4 with explicit filtering rules).
More diverse than single-source datasets (Common Crawl, Books3) because it includes code, patents, IRC, and academic papers; more opaque than documented datasets like C4 because composition percentages and preprocessing per component are not published.
jsonlines format streaming and decompression
Medium confidenceStores the 825 GiB corpus as line-delimited JSON objects (jsonlines format) compressed with zstandard (zst), enabling efficient streaming consumption without full decompression. Each line is a complete JSON object (typically {"text": "...", "meta": {...}}), allowing dataloaders to read and tokenize documents sequentially without loading the entire corpus into memory. Zstandard compression provides ~3-4x compression ratio while maintaining fast decompression speeds suitable for training pipelines.
Uses jsonlines + zstandard compression to enable streaming consumption without full decompression, allowing training pipelines to read documents sequentially from disk. This contrasts with monolithic formats (single large tar.gz) that require full decompression before use, or uncompressed jsonlines that consume 825 GiB of disk space. The combination optimizes for both storage efficiency (~3-4x compression) and streaming speed (fast zstandard decompression).
More efficient than uncompressed jsonlines (saves ~500 GiB disk space) and faster to decompress than gzip or bzip2; less random-access-friendly than database formats (SQLite, Parquet) but simpler to distribute and parse.
academic and scientific text sourcing (pubmed, arxiv)
Medium confidenceIncludes curated academic and scientific text from PubMed (biomedical literature abstracts and full texts) and ArXiv (preprints in physics, mathematics, computer science, and related fields). These components provide domain-specific vocabulary, citation patterns, and technical knowledge that enable models to understand scientific writing and reasoning. The exact filtering criteria, date ranges, and preprocessing applied to PubMed and ArXiv are not documented.
Integrates two major academic sources (PubMed for biomedical literature, ArXiv for physics/math/CS preprints) into a single corpus, providing models with exposure to both established scientific knowledge and cutting-edge research. This contrasts with web-only datasets (Common Crawl) that underrepresent academic writing, or single-domain academic datasets (e.g., S2ORC focused on computer science).
Broader academic coverage than S2ORC (which focuses on computer science) because it includes PubMed biomedical literature; more comprehensive than web-only datasets because it captures peer-reviewed and preprint literature with technical depth.
code and software repository sourcing (github)
Medium confidenceIncludes source code from GitHub repositories, providing models with exposure to programming languages, software patterns, and code documentation. The GitHub component enables models to learn code syntax, function signatures, and common programming idioms across multiple languages. Exact filtering criteria (e.g., license types, repository size, programming languages included) and preprocessing (e.g., comment removal, tokenization) are not documented.
Integrates real-world GitHub source code into a general-purpose pretraining corpus, enabling models trained on Pile to learn code patterns alongside natural language. This contrasts with code-only datasets (CodeSearchNet, GitHub-Code) or natural-language-only datasets (Common Crawl) that separate code and text. The inclusion of code in a general corpus enables models to understand code-in-context (e.g., code in documentation, code comments).
Broader than code-only datasets because it includes code alongside natural language documentation and comments; more comprehensive than web-only datasets because it captures real-world software patterns from production repositories.
web text sourcing (openwebtext2, pile-cc)
Medium confidenceIncludes web-crawled text from OpenWebText2 (a recreation of the original OpenWebText dataset used to train GPT-2) and Pile-CC (a filtered subset of Common Crawl). These components provide diverse, naturally-occurring text from the internet, including news, blogs, forums, and general web content. The filtering criteria, quality thresholds, and deduplication methodology for web sources are not documented.
Combines two web-crawled sources (OpenWebText2 for GPT-2 compatibility, Pile-CC for Common Crawl filtering) into a single corpus, providing models with diverse, naturally-occurring web text. This contrasts with academic-only datasets or single-source web datasets, enabling models to learn from both curated and web-scale text simultaneously.
More diverse than single-source web datasets (Common Crawl alone) because it includes OpenWebText2 for historical compatibility; more comprehensive than academic-only datasets because it captures real-world language use from millions of web pages.
specialized domain sourcing (uspto, irc, stack exchange)
Medium confidenceIncludes text from specialized domains: USPTO patents (technical descriptions and claims), Ubuntu IRC (real-time chat and technical support discussions), and Stack Exchange (Q&A across programming, science, and general knowledge). These components provide domain-specific vocabulary, problem-solving patterns, and technical reasoning that enable models to understand specialized contexts. Exact filtering and preprocessing for each specialized source are not documented.
Integrates three specialized, non-traditional text sources (patents, IRC chat, Q&A) into a general-purpose pretraining corpus, enabling models to learn technical reasoning and problem-solving patterns alongside general language. This contrasts with academic-only or web-only datasets that underrepresent specialized domains, or single-domain specialized datasets (e.g., patent-only corpora).
More diverse than single-domain specialized datasets because it includes patents, chat, and Q&A simultaneously; more comprehensive than general-purpose datasets because it captures real-world technical problem-solving and specialized vocabulary.
book and literary text sourcing (books3, project gutenberg)
Medium confidenceIncludes long-form literary and non-fiction text from Books3 (a large collection of books) and Project Gutenberg (public-domain books). These components provide models with exposure to narrative structure, literary language, and long-range dependencies that enable understanding of complex, multi-paragraph text. Exact filtering criteria, copyright status, and preprocessing for book sources are not documented.
Combines two book sources (Books3 for breadth, Project Gutenberg for public-domain reliability) into a general-purpose corpus, enabling models to learn narrative structure and literary language alongside other domains. This contrasts with web-only datasets that underrepresent long-form narrative, or book-only datasets that lack diversity in other domains.
More comprehensive than web-only datasets because it includes long-form narrative and literary language; more diverse than book-only datasets because it includes code, academic papers, and web text alongside books.
public reproducibility and open-source model training
Medium confidenceEnables reproducible, open-source language model training by providing a publicly-available, freely-downloadable dataset used to train GPT-NeoX, Pythia, and other open models. The dataset is released under an open license (exact license terms not specified in artifact), allowing researchers and organizations to train models with full transparency and reproducibility. The Pile has influenced the design of subsequent open datasets, establishing a standard for open-source LLM training data.
Provides a large-scale, publicly-available, freely-downloadable pretraining dataset specifically designed for open-source LLM development, enabling full reproducibility and transparency. This contrasts with proprietary datasets (used by OpenAI, Google, Meta) that are not publicly available, or academic datasets that lack the scale and diversity needed for large models. The Pile's influence on subsequent open datasets (C4, RedPajama, etc.) establishes it as a foundational artifact for open-source AI.
More accessible than proprietary datasets (OpenAI, Google) because it is freely available; more comprehensive than earlier open datasets (WikiText, BookCorpus) because it includes 825 GiB across 22 domains; more influential than contemporary datasets because it established design patterns for open-source LLM training data.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with The Pile, ranked by overlap. Discovered automatically through the match graph.
TxT360
Dataset by LLM360. 4,90,092 downloads.
Dolma
Allen AI's 3T token dataset for fully reproducible LLM training.
TinyLlama
1.1B model pre-trained on 3T tokens for edge use.
RedPajama v2
30 trillion token web dataset with 40+ quality signals per document.
Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)
* ⭐ 03/2023: [PaLM-E: An Embodied Multimodal Language Model (PaLM-E)](https://arxiv.org/abs/2303.03378)
LitGPT
Lightning AI's LLM library — pretrain, fine-tune, deploy with clean PyTorch Lightning code.
Best For
- ✓Research teams training open-source large language models (100M+ parameters)
- ✓Organizations building foundation models with public reproducibility requirements
- ✓Researchers studying multi-domain text modeling and transfer learning
- ✓Researchers evaluating language models trained on Pile or other datasets
- ✓Teams benchmarking model generalization across heterogeneous text domains
- ✓Open-source LLM developers seeking standardized evaluation beyond perplexity on single domains
- ✓ML engineers building training pipelines with PyTorch, TensorFlow, or Hugging Face
- ✓Teams seeking to minimize data engineering overhead when adopting large-scale pretraining datasets
Known Limitations
- ⚠English-only; no multilingual or non-English language support
- ⚠Fixed snapshot (825 GiB as of 2020 publication); no documented versioning or update cadence
- ⚠No documented deduplication methodology — potential for duplicate documents across component datasets
- ⚠Composition percentages of 22 component datasets unknown — cannot optimize domain weighting
- ⚠825 GiB storage requirement; no streaming download or partial corpus guidance documented
- ⚠Single scalar metric (BPB) masks per-domain performance variation — cannot diagnose which domains a model struggles with
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
EleutherAI's seminal 825 GiB English text dataset composed of 22 diverse high-quality subsets. Includes academic papers (PubMed, ArXiv), books (Books3, Gutenberg), code (GitHub), web (OpenWebText2, Pile-CC), and specialized sources (USPTO patents, Ubuntu IRC, Stack Exchange). Designed for training large language models with broad knowledge coverage. Used to train GPT-NeoX, Pythia, and influenced the design of virtually every subsequent open training dataset.
Categories
Alternatives to The Pile
The GitHub for AI — 500K+ models, datasets, Spaces, Inference API, hub for open-source AI.
Compare →FLUX, Stable Diffusion, SDXL, SD3, LoRA, Fine Tuning, DreamBooth, Training, Automatic1111, Forge WebUI, SwarmUI, DeepFake, TTS, Animation, Text To Video, Tutorials, Guides, Lectures, Courses, ComfyUI, Google Colab, RunPod, Kaggle, NoteBooks, ControlNet, TTS, Voice Cloning, AI, AI News, ML, ML News,
Compare →Are you the builder of The Pile?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →