C4 (Colossal Clean Crawled Corpus)
DatasetFreeGoogle's cleaned Common Crawl corpus used to train T5.
Capabilities8 decomposed
large-scale english text corpus filtering and deduplication
Medium confidenceProcesses 750GB of raw Common Crawl data through a multi-stage heuristic filtering pipeline that removes short pages (threshold-based length filtering), deduplicates at the sentence level using string matching or probabilistic techniques, filters offensive content via keyword/pattern matching, and restricts output to English-language documents. The filtering approach uses rule-based heuristics rather than learned classifiers, making it deterministic and reproducible across dataset versions.
Uses deterministic heuristic-based filtering (length thresholds, keyword matching, language detection) applied at scale to 750GB of Common Crawl, enabling reproducible dataset creation without learned classifiers; includes sentence-level deduplication to remove redundant training examples
More transparent and reproducible than learned filtering approaches; larger and more thoroughly deduplicated than raw Common Crawl, but less sophisticated than newer datasets like Fineweb that use neural classifiers for quality scoring
multilingual corpus variant with 108-language support
Medium confidenceExtends the core English C4 dataset with a multilingual variant covering 108 languages, applying the same heuristic filtering and deduplication pipeline across non-English documents. Language detection and filtering are applied per-language, with separate dataset splits for each language or combined multilingual batches. This enables training of multilingual models on a standardized, cleaned corpus without requiring separate language-specific curation.
Applies consistent heuristic filtering and deduplication across 108 languages using language-agnostic rules, enabling direct comparison of data quality and model performance across languages without language-specific tuning
Broader language coverage than most pre-training datasets; maintains consistency with English C4 filtering, but lacks language-specific quality signals that specialized multilingual datasets (e.g., OSCAR) may include
news-domain-specific text variant with distribution matching
Medium confidenceProvides a 'realnewslike' variant of C4 that filters documents to match the distribution and style of real news articles, enabling training of models on news-domain text without requiring separate news corpus collection. This variant applies domain-specific heuristics (e.g., article structure, publication patterns, temporal signals) to select documents that resemble news content, creating a curated subset suitable for news-focused model training or evaluation.
Applies domain-specific filtering heuristics to C4 to create a news-distribution-matched subset, enabling news-focused pre-training without separate news corpus collection; maintains consistency with C4 cleaning pipeline while adding domain-specific selection
Simpler and more reproducible than collecting news from multiple sources; smaller and more focused than full C4, but may lack editorial quality and fact-checking standards of professional news datasets
hugging face dataset streaming and caching integration
Medium confidenceIntegrates with Hugging Face's datasets library to enable streaming download, local caching, and efficient batching of C4 data without requiring full dataset download upfront. Uses Apache Arrow format for columnar storage, supports lazy loading and on-demand access to specific splits/languages, and provides built-in caching mechanisms to avoid re-downloading. Integration with Hugging Face Hub enables version control, dataset card documentation, and community contributions.
Native integration with Hugging Face datasets library using Apache Arrow columnar format, enabling efficient streaming, lazy loading, and automatic caching without requiring full dataset materialization; supports version control and community contributions via Hub
More convenient than manual Common Crawl download and processing; streaming capability reduces storage requirements vs. downloading full 750GB; less flexible than raw Common Crawl access but more curated and easier to use
reproducible dataset versioning and documentation
Medium confidenceProvides versioned dataset snapshots on Hugging Face Hub with detailed documentation (dataset cards, filtering methodology, statistics) enabling reproducible model training and benchmarking. Each version is immutable and tracked, allowing researchers to cite specific dataset versions in papers and reproduce results. Dataset cards include filtering heuristics, language coverage, deduplication statistics, and known limitations, facilitating transparent evaluation and comparison.
Provides immutable, versioned dataset snapshots with comprehensive documentation on Hugging Face Hub, enabling persistent citation and reproducible research; includes detailed dataset cards describing filtering methodology and known limitations
More reproducible than raw Common Crawl access; better documented than most pre-training datasets; enables long-term research reproducibility through version control, but requires Hugging Face Hub infrastructure
sentence-level deduplication at scale
Medium confidenceImplements sentence-level deduplication across 750GB of text using probabilistic or exact-match techniques to identify and remove duplicate sentences within and across documents. This reduces redundancy in training data, improving model training efficiency and reducing overfitting to repeated patterns. Deduplication is applied during dataset construction, not at inference time, creating a cleaner training corpus without duplicated examples.
Applies sentence-level deduplication at scale across 750GB using deterministic techniques, removing redundant training examples while maintaining document structure; enables cleaner training data without requiring learned quality models
More thorough than document-level deduplication; simpler and more reproducible than semantic deduplication approaches; reduces training data size but may miss near-duplicates that learned methods would catch
offensive content filtering via heuristic rules
Medium confidenceFilters offensive, inappropriate, or harmful content from C4 using keyword matching, pattern-based rules, and heuristic signals (e.g., profanity lists, known offensive phrases) applied during dataset construction. This creates a cleaner training corpus less likely to produce offensive model outputs, though heuristic filtering is inherently imperfect and may miss context-dependent offensiveness or allow some harmful content through.
Uses deterministic heuristic rules (keyword matching, pattern-based filtering) to remove offensive content at scale, enabling reproducible and transparent filtering without learned classifiers; applied during dataset construction rather than at inference time
More transparent and reproducible than learned filtering approaches; simpler to implement and audit than neural classifiers; less sophisticated than context-aware filtering but faster and more deterministic
short-document filtering with length-based heuristics
Medium confidenceRemoves documents shorter than a minimum length threshold (typically 100 words) to filter out low-quality, stub, or boilerplate content. This filtering is applied during corpus curation and reduces the proportion of short, low-information-density documents in the training corpus. The approach is simple and transparent but may remove legitimate short-form content like abstracts, summaries, or social media posts.
Uses simple, transparent length-based filtering (minimum 100 words) to remove low-quality stub content, making the filtering auditable and reproducible; most alternative corpora use more complex quality heuristics
Simpler and more transparent than learned quality classifiers, but less effective at identifying low-quality content that is not simply short
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with C4 (Colossal Clean Crawled Corpus), ranked by overlap. Discovered automatically through the match graph.
mC4
Multilingual web corpus covering 101 languages.
OPUS
Massive parallel corpus for machine translation.
CulturaX
6.3T token multilingual dataset across 167 languages.
RedPajama v2
30 trillion token web dataset with 40+ quality signals per document.
fineweb
Dataset by HuggingFaceFW. 6,43,166 downloads.
FineFineWeb
Dataset by m-a-p. 4,59,057 downloads.
Best For
- ✓Research teams training foundational LLMs and needing a reproducible baseline dataset
- ✓Organizations benchmarking model performance against T5-era standards
- ✓Researchers studying data quality and filtering effects on model behavior
- ✓Multilingual model developers needing balanced, cleaned data across many languages
- ✓Researchers studying cross-lingual transfer and language-specific biases
- ✓Teams building models for low-resource languages using high-resource language data
- ✓News organizations and media companies training domain-specific models
- ✓Researchers studying news bias, misinformation, and domain-specific language patterns
Known Limitations
- ⚠Heuristic-based filtering may miss nuanced offensive content or allow some low-quality text through
- ⚠750GB dataset size requires significant storage and bandwidth for download
- ⚠English-only variant excludes non-English speakers; multilingual variant adds complexity
- ⚠Sentence-level deduplication may not catch semantic duplicates or near-duplicates
- ⚠Dataset is static and not updated; newer web content after crawl date is not included
- ⚠Language detection errors may misclassify documents, especially for similar languages or code-mixed text
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Google's cleaned version of Common Crawl used to train the original T5 model. 750GB of English text filtered with heuristic rules: removed short pages, deduped sentences, filtered offensive content, and restricted to English. Despite being superseded by newer datasets, C4 remains one of the most studied and benchmarked pre-training datasets. Available in English, multilingual (108 languages), and realnewslike variants on Hugging Face.
Categories
Alternatives to C4 (Colossal Clean Crawled Corpus)
Open-source image generation — SD3, SDXL, massive ecosystem of LoRAs, ControlNets, runs locally.
Compare →Are you the builder of C4 (Colossal Clean Crawled Corpus)?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →