What can C4 (Colossal Clean Crawled Corpus) do?

large-scale english text corpus filtering and deduplication, multilingual corpus variant with 108-language support, news-domain-specific text variant with distribution matching, hugging face dataset streaming and caching integration, reproducible dataset versioning and documentation, sentence-level deduplication at scale, offensive content filtering via heuristic rules, short-document filtering with length-based heuristics

C4 (Colossal Clean Crawled Corpus)

DatasetFree

Google's cleaned Common Crawl corpus used to train T5.

Open Source

/ 100

8 capabilities

Capabilities8 decomposed

large-scale english text corpus filtering and deduplication

Medium confidence

Processes 750GB of raw Common Crawl data through a multi-stage heuristic filtering pipeline that removes short pages (threshold-based length filtering), deduplicates at the sentence level using string matching or probabilistic techniques, filters offensive content via keyword/pattern matching, and restricts output to English-language documents. The filtering approach uses rule-based heuristics rather than learned classifiers, making it deterministic and reproducible across dataset versions.

Solves for

Train a large language model on high-quality English text without manually curating dataObtain a deduplicated, cleaned corpus that removes low-quality and offensive content at scaleBenchmark model performance against a standardized, widely-used pre-training datasetStudy the impact of different data cleaning strategies on downstream model quality

Best for

Research teams training foundational LLMs and needing a reproducible baseline dataset

Organizations benchmarking model performance against T5-era standards

Researchers studying data quality and filtering effects on model behavior

Requires

Hugging Face account or API access to download dataset

Minimum 750GB disk storage for full English variant

Python 3.7+ with datasets library (pip install datasets)

Limitations

Heuristic-based filtering may miss nuanced offensive content or allow some low-quality text through

750GB dataset size requires significant storage and bandwidth for download

English-only variant excludes non-English speakers; multilingual variant adds complexity

What makes it unique

Uses deterministic heuristic-based filtering (length thresholds, keyword matching, language detection) applied at scale to 750GB of Common Crawl, enabling reproducible dataset creation without learned classifiers; includes sentence-level deduplication to remove redundant training examples

vs alternatives

More transparent and reproducible than learned filtering approaches; larger and more thoroughly deduplicated than raw Common Crawl, but less sophisticated than newer datasets like Fineweb that use neural classifiers for quality scoring

multilingual corpus variant with 108-language support

Medium confidence

Extends the core English C4 dataset with a multilingual variant covering 108 languages, applying the same heuristic filtering and deduplication pipeline across non-English documents. Language detection and filtering are applied per-language, with separate dataset splits for each language or combined multilingual batches. This enables training of multilingual models on a standardized, cleaned corpus without requiring separate language-specific curation.

Solves for

Train multilingual language models on a consistent, deduplicated corpus across 108 languagesEvaluate model performance on non-English languages using a standardized benchmark datasetStudy cross-lingual transfer and language-specific data quality effectsBuild language-agnostic pre-training baselines for comparison with monolingual models

Best for

Multilingual model developers needing balanced, cleaned data across many languages

Researchers studying cross-lingual transfer and language-specific biases

Teams building models for low-resource languages using high-resource language data

Requires

Hugging Face account with datasets library

Language detection library (e.g., fasttext, langdetect) for processing

Significantly more storage than English variant (exact size varies by language selection)

Limitations

Language detection errors may misclassify documents, especially for similar languages or code-mixed text

Data volume varies significantly across languages; some languages have much less content than others

Heuristic filtering may not account for language-specific quality signals or cultural context

What makes it unique

Applies consistent heuristic filtering and deduplication across 108 languages using language-agnostic rules, enabling direct comparison of data quality and model performance across languages without language-specific tuning

vs alternatives

Broader language coverage than most pre-training datasets; maintains consistency with English C4 filtering, but lacks language-specific quality signals that specialized multilingual datasets (e.g., OSCAR) may include

news-domain-specific text variant with distribution matching

Medium confidence

Provides a 'realnewslike' variant of C4 that filters documents to match the distribution and style of real news articles, enabling training of models on news-domain text without requiring separate news corpus collection. This variant applies domain-specific heuristics (e.g., article structure, publication patterns, temporal signals) to select documents that resemble news content, creating a curated subset suitable for news-focused model training or evaluation.

Solves for

Train language models specifically optimized for news understanding and generationEvaluate model performance on news-domain text using a standardized, cleaned datasetStudy domain-specific biases and quality differences between general web text and newsCreate news-focused pre-training baselines without manually collecting news articles

Best for

News organizations and media companies training domain-specific models

Researchers studying news bias, misinformation, and domain-specific language patterns

Teams building news summarization, classification, or generation systems

Requires

Hugging Face account with datasets library

Python 3.7+ for dataset loading and processing

Domain-specific text processing tools if custom filtering is needed

Limitations

Domain filtering heuristics may not perfectly capture news-like content; false positives/negatives likely

News distribution in Common Crawl may not match real news publication patterns or editorial standards

Smaller dataset size than full C4 due to domain filtering, reducing training data volume

What makes it unique

Applies domain-specific filtering heuristics to C4 to create a news-distribution-matched subset, enabling news-focused pre-training without separate news corpus collection; maintains consistency with C4 cleaning pipeline while adding domain-specific selection

vs alternatives

Simpler and more reproducible than collecting news from multiple sources; smaller and more focused than full C4, but may lack editorial quality and fact-checking standards of professional news datasets

hugging face dataset streaming and caching integration

Medium confidence

Integrates with Hugging Face's datasets library to enable streaming download, local caching, and efficient batching of C4 data without requiring full dataset download upfront. Uses Apache Arrow format for columnar storage, supports lazy loading and on-demand access to specific splits/languages, and provides built-in caching mechanisms to avoid re-downloading. Integration with Hugging Face Hub enables version control, dataset card documentation, and community contributions.

Solves for

Download and cache C4 data efficiently without storing entire 750GB locallyStream C4 data directly into training pipelines with minimal memory overheadAccess specific dataset splits (train/validation) or language variants on-demandVersion-control and share custom C4 subsets or filtered variants with teams

Best for

ML engineers training models with limited local storage or bandwidth

Research teams needing reproducible dataset access across multiple machines

Organizations building data pipelines that integrate C4 with other datasets

Requires

Python 3.7+

datasets library (pip install datasets>=2.0.0)

Hugging Face account (free) for dataset access

Limitations

Streaming requires stable network connection; interruptions may corrupt cache

Initial download of metadata and first batch may be slow for large datasets

Caching directory can grow large if multiple variants or languages are accessed

What makes it unique

Native integration with Hugging Face datasets library using Apache Arrow columnar format, enabling efficient streaming, lazy loading, and automatic caching without requiring full dataset materialization; supports version control and community contributions via Hub

vs alternatives

More convenient than manual Common Crawl download and processing; streaming capability reduces storage requirements vs. downloading full 750GB; less flexible than raw Common Crawl access but more curated and easier to use

reproducible dataset versioning and documentation

Medium confidence

Provides versioned dataset snapshots on Hugging Face Hub with detailed documentation (dataset cards, filtering methodology, statistics) enabling reproducible model training and benchmarking. Each version is immutable and tracked, allowing researchers to cite specific dataset versions in papers and reproduce results. Dataset cards include filtering heuristics, language coverage, deduplication statistics, and known limitations, facilitating transparent evaluation and comparison.

Solves for

Reproduce model training results by accessing the exact same dataset version used in published papersDocument and cite dataset versions in research papers with persistent identifiersCompare model performance across different dataset versions to isolate data quality effectsUnderstand dataset construction methodology and filtering decisions through detailed documentation

Best for

Researchers publishing papers requiring reproducible dataset access and citation

Teams conducting ablation studies on data quality and filtering effects

Organizations maintaining long-term model training pipelines with version control

Requires

Hugging Face account for dataset access

Understanding of dataset card format and metadata structure

Git knowledge if contributing to dataset versioning

Limitations

Dataset versioning adds complexity; older versions may become outdated or deprecated

Documentation may not capture all edge cases or language-specific filtering nuances

Version history can be large; accessing old versions may require significant storage

What makes it unique

Provides immutable, versioned dataset snapshots with comprehensive documentation on Hugging Face Hub, enabling persistent citation and reproducible research; includes detailed dataset cards describing filtering methodology and known limitations

vs alternatives

More reproducible than raw Common Crawl access; better documented than most pre-training datasets; enables long-term research reproducibility through version control, but requires Hugging Face Hub infrastructure

sentence-level deduplication at scale

Medium confidence

Implements sentence-level deduplication across 750GB of text using probabilistic or exact-match techniques to identify and remove duplicate sentences within and across documents. This reduces redundancy in training data, improving model training efficiency and reducing overfitting to repeated patterns. Deduplication is applied during dataset construction, not at inference time, creating a cleaner training corpus without duplicated examples.

Solves for

Reduce training data redundancy and improve model generalization by removing duplicate sentencesDecrease training time and computational cost by eliminating redundant examplesStudy the impact of deduplication on model quality and convergence speedCreate a cleaner training corpus that better represents diverse language patterns

Best for

Teams training large language models with limited computational budgets

Researchers studying data quality and redundancy effects on model performance

Organizations optimizing training efficiency and reducing carbon footprint

Requires

Deduplication algorithm implementation (exact-match or probabilistic)

Sentence tokenization library (e.g., NLTK, spaCy)

Sufficient memory for deduplication data structures (hash tables, bloom filters)

Limitations

Exact-match deduplication may miss semantic duplicates or near-duplicates with minor variations

Probabilistic deduplication (e.g., MinHash) introduces false positives/negatives; tuning required

Sentence-level deduplication may not catch document-level redundancy or topical repetition

What makes it unique

Applies sentence-level deduplication at scale across 750GB using deterministic techniques, removing redundant training examples while maintaining document structure; enables cleaner training data without requiring learned quality models

vs alternatives

More thorough than document-level deduplication; simpler and more reproducible than semantic deduplication approaches; reduces training data size but may miss near-duplicates that learned methods would catch

offensive content filtering via heuristic rules

Medium confidence

Filters offensive, inappropriate, or harmful content from C4 using keyword matching, pattern-based rules, and heuristic signals (e.g., profanity lists, known offensive phrases) applied during dataset construction. This creates a cleaner training corpus less likely to produce offensive model outputs, though heuristic filtering is inherently imperfect and may miss context-dependent offensiveness or allow some harmful content through.

Solves for

Create a training dataset with reduced offensive content to improve model safety and alignmentReduce the likelihood of models generating offensive or harmful outputsStudy the impact of content filtering on model behavior and downstream biasProvide a baseline dataset for safety-focused model training and evaluation

Best for

Teams training models for public-facing applications requiring safety guardrails

Researchers studying the relationship between training data content and model behavior

Organizations building models for sensitive domains (healthcare, education, etc.)

Requires

Profanity/offensive content lists or keyword databases

Pattern matching library (regex, etc.)

Language-specific filtering rules for each language variant

Limitations

Heuristic-based filtering is brittle; easily evaded by spelling variations or context-dependent language

Keyword matching may over-filter legitimate content (e.g., educational discussions of offensive topics)

No understanding of context; cannot distinguish between reclaimed language, quotes, or educational use

What makes it unique

Uses deterministic heuristic rules (keyword matching, pattern-based filtering) to remove offensive content at scale, enabling reproducible and transparent filtering without learned classifiers; applied during dataset construction rather than at inference time

vs alternatives

More transparent and reproducible than learned filtering approaches; simpler to implement and audit than neural classifiers; less sophisticated than context-aware filtering but faster and more deterministic

short-document filtering with length-based heuristics

Medium confidence

Removes documents shorter than a minimum length threshold (typically 100 words) to filter out low-quality, stub, or boilerplate content. This filtering is applied during corpus curation and reduces the proportion of short, low-information-density documents in the training corpus. The approach is simple and transparent but may remove legitimate short-form content like abstracts, summaries, or social media posts.

Solves for

I need to filter out low-quality stub pages and boilerplate content from my training corpusI want to ensure my training corpus contains primarily substantive, information-dense documentsI need to reduce the proportion of short-form content in my training corpus

Best for

researchers training language models and concerned about low-quality content

teams building models for long-form text generation or understanding

organizations with quality requirements that prioritize substantive content

Requires

Understanding that some legitimate short-form content is removed

Acceptance that length is an imperfect proxy for quality

Limitations

Length-based filtering is a crude proxy for quality; some short documents are high-quality, and some long documents are low-quality

Filtering removes legitimate short-form content like abstracts, summaries, code snippets, or social media posts

Minimum length threshold is fixed and not adjustable; researchers cannot modify filtering criteria without re-processing the corpus

What makes it unique

Uses simple, transparent length-based filtering (minimum 100 words) to remove low-quality stub content, making the filtering auditable and reproducible; most alternative corpora use more complex quality heuristics

vs alternatives

Simpler and more transparent than learned quality classifiers, but less effective at identifying low-quality content that is not simply short

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with C4 (Colossal Clean Crawled Corpus), ranked by overlap. Discovered automatically through the match graph.

Dataset60

mC4

Multilingual web corpus covering 101 languages.

multilingual-text-corpus-extraction-from-web-crawlquality-filtering-and-deduplication-pipelinelanguage-specific-corpus-filtering-and-subset-selection

3 shared capabilities

Dataset60

OPUS

Massive parallel corpus for machine translation.

multilingual parallel corpus discovery via searchable indexbulk parallel corpus download with source-specific formattingdomain-specific parallel corpus selection and filtering

3 shared capabilities

Dataset61

CulturaX

6.3T token multilingual dataset across 167 languages.

multilingual-corpus-deduplication-at-scalequality-filtering-with-language-specific-heuristics

2 shared capabilities

Dataset61

RedPajama v2

30 trillion token web dataset with 40+ quality signals per document.

multilingual web corpus with consistent annotation across 5 languagesmulti-language web-scale document collection with 40+ quality annotations

2 shared capabilities

Dataset21

fineweb

Dataset by HuggingFaceFW. 6,43,166 downloads.

language detection and english-only filteringlarge-scale web text corpus curation and filtering

2 shared capabilities

Dataset20

FineFineWeb

Dataset by m-a-p. 4,59,057 downloads.

text-generation model pretraining data pipelinelarge-scale web text corpus loading and streaming

2 shared capabilities

Best For

✓Research teams training foundational LLMs and needing a reproducible baseline dataset
✓Organizations benchmarking model performance against T5-era standards
✓Researchers studying data quality and filtering effects on model behavior
✓Multilingual model developers needing balanced, cleaned data across many languages
✓Researchers studying cross-lingual transfer and language-specific biases
✓Teams building models for low-resource languages using high-resource language data
✓News organizations and media companies training domain-specific models
✓Researchers studying news bias, misinformation, and domain-specific language patterns

Known Limitations

⚠Heuristic-based filtering may miss nuanced offensive content or allow some low-quality text through
⚠750GB dataset size requires significant storage and bandwidth for download
⚠English-only variant excludes non-English speakers; multilingual variant adds complexity
⚠Sentence-level deduplication may not catch semantic duplicates or near-duplicates
⚠Dataset is static and not updated; newer web content after crawl date is not included
⚠Language detection errors may misclassify documents, especially for similar languages or code-mixed text

Requirements

Hugging Face account or API access to download datasetMinimum 750GB disk storage for full English variantPython 3.7+ with datasets library (pip install datasets)Network bandwidth for multi-hour download (varies by connection speed)Hugging Face account with datasets libraryLanguage detection library (e.g., fasttext, langdetect) for processingSignificantly more storage than English variant (exact size varies by language selection)Python 3.7+ with multilingual text processing support

Input / Output

Accepts: Common Crawl raw HTML/text snapshots, Common Crawl documents in 108 languages, C4 documents filtered for news-like characteristics, Hugging Face Hub dataset identifiers (allenai/c4), Dataset metadata and documentation, Raw text documents from Common Crawl

Produces: Cleaned, deduplicated text documents, Structured dataset splits (train/validation), Parquet or JSONL serialized format, Per-language deduplicated text splits, Language-tagged document collections, Parquet/JSONL format with language metadata, News-domain text documents, Structured dataset splits with news metadata, Parquet/JSONL format, PyArrow Table objects, Batched tensors for training, Streaming iterables for on-demand access, Versioned dataset snapshots, Dataset cards with methodology and statistics, Citation metadata (BibTeX, etc.), Deduplicated text corpus, Deduplication statistics (% removed, etc.), Filtered text corpus with offensive content removed, Filtering statistics (% removed by category, etc.), Filtered text documents with short documents removed, Filtering metadata (if available)

UnfragileRank

Adoption70%(30% weight)

Quality85%(25% weight)

Ecosystem40%(10% weight)

Match Graph25%(30% weight)

Freshness100%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Dataset

8 capabilities

Visit C4 (Colossal Clean Crawled Corpus)→

About

Google's cleaned version of Common Crawl used to train the original T5 model. 750GB of English text filtered with heuristic rules: removed short pages, deduped sentences, filtered offensive content, and restricted to English. Despite being superseded by newer datasets, C4 remains one of the most studied and benchmarked pre-training datasets. Available in English, multilingual (108 languages), and realnewslike variants on Hugging Face.

Alternatives to C4 (Colossal Clean Crawled Corpus)

GPT-4o84Model

OpenAI's fastest multimodal flagship model with 128K context.

Compare →

Stable Diffusion79Model

Open-source image generation — SD3, SDXL, massive ecosystem of LoRAs, ControlNets, runs locally.

Compare →

Mistral Large77Model

Mistral's 123B flagship model rivaling GPT-4o.

Compare →

xCodeEval67Benchmark

Multilingual code evaluation across 17 languages.

Compare →

Are you the builder of C4 (Colossal Clean Crawled Corpus)?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

seed developer essentials

Looking for something else?

Search →

Capabilities8 decomposed

large-scale english text corpus filtering and deduplication

Medium confidence

Solves for

Best for

Research teams training foundational LLMs and needing a reproducible baseline dataset

Organizations benchmarking model performance against T5-era standards

Researchers studying data quality and filtering effects on model behavior

Requires

Hugging Face account or API access to download dataset

Minimum 750GB disk storage for full English variant

Python 3.7+ with datasets library (pip install datasets)

Limitations

Heuristic-based filtering may miss nuanced offensive content or allow some low-quality text through

750GB dataset size requires significant storage and bandwidth for download

English-only variant excludes non-English speakers; multilingual variant adds complexity

What makes it unique

vs alternatives

multilingual corpus variant with 108-language support

Medium confidence

Solves for

Best for

Multilingual model developers needing balanced, cleaned data across many languages

Researchers studying cross-lingual transfer and language-specific biases

Teams building models for low-resource languages using high-resource language data

Requires

Hugging Face account with datasets library

Language detection library (e.g., fasttext, langdetect) for processing

Significantly more storage than English variant (exact size varies by language selection)

Limitations

Language detection errors may misclassify documents, especially for similar languages or code-mixed text

Data volume varies significantly across languages; some languages have much less content than others

Heuristic filtering may not account for language-specific quality signals or cultural context

What makes it unique

vs alternatives

news-domain-specific text variant with distribution matching

Medium confidence

Solves for

Best for

News organizations and media companies training domain-specific models

Researchers studying news bias, misinformation, and domain-specific language patterns

Teams building news summarization, classification, or generation systems

Requires

Hugging Face account with datasets library

Python 3.7+ for dataset loading and processing

Domain-specific text processing tools if custom filtering is needed

Limitations

Domain filtering heuristics may not perfectly capture news-like content; false positives/negatives likely

News distribution in Common Crawl may not match real news publication patterns or editorial standards

Smaller dataset size than full C4 due to domain filtering, reducing training data volume

What makes it unique

vs alternatives

hugging face dataset streaming and caching integration

Medium confidence

Solves for

Best for

ML engineers training models with limited local storage or bandwidth

Research teams needing reproducible dataset access across multiple machines

Organizations building data pipelines that integrate C4 with other datasets

Requires

Python 3.7+

datasets library (pip install datasets>=2.0.0)

Hugging Face account (free) for dataset access

Limitations

Streaming requires stable network connection; interruptions may corrupt cache

Initial download of metadata and first batch may be slow for large datasets

Caching directory can grow large if multiple variants or languages are accessed

What makes it unique

vs alternatives

reproducible dataset versioning and documentation

Medium confidence

Solves for

Best for

Researchers publishing papers requiring reproducible dataset access and citation

Teams conducting ablation studies on data quality and filtering effects

Organizations maintaining long-term model training pipelines with version control

Requires

Hugging Face account for dataset access

Understanding of dataset card format and metadata structure

Git knowledge if contributing to dataset versioning

Limitations

Dataset versioning adds complexity; older versions may become outdated or deprecated

Documentation may not capture all edge cases or language-specific filtering nuances

Version history can be large; accessing old versions may require significant storage

What makes it unique

vs alternatives

sentence-level deduplication at scale

Medium confidence

Solves for

Best for

Teams training large language models with limited computational budgets

Researchers studying data quality and redundancy effects on model performance

Organizations optimizing training efficiency and reducing carbon footprint

Requires

Deduplication algorithm implementation (exact-match or probabilistic)

Sentence tokenization library (e.g., NLTK, spaCy)

Sufficient memory for deduplication data structures (hash tables, bloom filters)

Limitations

Exact-match deduplication may miss semantic duplicates or near-duplicates with minor variations

Probabilistic deduplication (e.g., MinHash) introduces false positives/negatives; tuning required

Sentence-level deduplication may not catch document-level redundancy or topical repetition

What makes it unique

vs alternatives

offensive content filtering via heuristic rules

Medium confidence

Solves for

Best for

Teams training models for public-facing applications requiring safety guardrails

Researchers studying the relationship between training data content and model behavior

Organizations building models for sensitive domains (healthcare, education, etc.)

Requires

Profanity/offensive content lists or keyword databases

Pattern matching library (regex, etc.)

Language-specific filtering rules for each language variant

Limitations

Heuristic-based filtering is brittle; easily evaded by spelling variations or context-dependent language

Keyword matching may over-filter legitimate content (e.g., educational discussions of offensive topics)

No understanding of context; cannot distinguish between reclaimed language, quotes, or educational use

What makes it unique

vs alternatives

short-document filtering with length-based heuristics

Medium confidence

Solves for

Best for

researchers training language models and concerned about low-quality content

teams building models for long-form text generation or understanding

organizations with quality requirements that prioritize substantive content

Requires

Understanding that some legitimate short-form content is removed

Acceptance that length is an imperfect proxy for quality

Limitations

Length-based filtering is a crude proxy for quality; some short documents are high-quality, and some long documents are low-quality

Filtering removes legitimate short-form content like abstracts, summaries, code snippets, or social media posts

Minimum length threshold is fixed and not adjustable; researchers cannot modify filtering criteria without re-processing the corpus

What makes it unique

vs alternatives

Simpler and more transparent than learned quality classifiers, but less effective at identifying low-quality content that is not simply short

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

About

Alternatives to C4 (Colossal Clean Crawled Corpus)

GPT-4o84Model

OpenAI's fastest multimodal flagship model with 128K context.

Compare →

Stable Diffusion79Model

Open-source image generation — SD3, SDXL, massive ecosystem of LoRAs, ControlNets, runs locally.

Compare →

Mistral Large77Model

Mistral's 123B flagship model rivaling GPT-4o.

Compare →

xCodeEval67Benchmark

Multilingual code evaluation across 17 languages.

Compare →

C4 (Colossal Clean Crawled Corpus)

Capabilities8 decomposed

large-scale english text corpus filtering and deduplication

multilingual corpus variant with 108-language support

news-domain-specific text variant with distribution matching

hugging face dataset streaming and caching integration

reproducible dataset versioning and documentation

sentence-level deduplication at scale

offensive content filtering via heuristic rules

short-document filtering with length-based heuristics

Related Artifactssharing capabilities

mC4

OPUS

CulturaX

RedPajama v2

fineweb

FineFineWeb

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to C4 (Colossal Clean Crawled Corpus)

Are you the builder of C4 (Colossal Clean Crawled Corpus)?

Get the weekly brief

Data Sources

C4 (Colossal Clean Crawled Corpus)

Capabilities8 decomposed

large-scale english text corpus filtering and deduplication

multilingual corpus variant with 108-language support

news-domain-specific text variant with distribution matching

hugging face dataset streaming and caching integration

reproducible dataset versioning and documentation

sentence-level deduplication at scale

offensive content filtering via heuristic rules

short-document filtering with length-based heuristics

Related Artifactssharing capabilities

mC4

OPUS

CulturaX

RedPajama v2

fineweb

FineFineWeb

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to C4 (Colossal Clean Crawled Corpus)

Are you the builder of C4 (Colossal Clean Crawled Corpus)?

Get the weekly brief

Data Sources