mC4

Q: What can mC4 do?

multilingual text corpus extraction from web crawl, quality filtering and deduplication at scale, language-stratified dataset sampling and balancing, streaming access to petabyte-scale corpus without full download, language-specific metadata and statistics reporting, reproducible dataset versioning and snapshot management, language family and script-based document grouping

DatasetFree

Multilingual web corpus covering 101 languages.

Open Source

/ 100

7 capabilities

Capabilities7 decomposed

multilingual text corpus extraction from web crawl

Medium confidence

Extracts and processes raw HTML/text from Common Crawl's petabyte-scale web archive, applying language identification across 101 languages using fastText language classifiers to segment documents by language before quality filtering. The pipeline processes crawl data in distributed fashion, identifying language boundaries at document level and routing to language-specific processing chains.

Solves for

I need training data for multilingual NLP models that covers diverse languages beyond EnglishI want to build language-specific datasets from web-scale sources without manually curating contentI need to understand language distribution and coverage across 100+ languages in a single coherent dataset

Best for

researchers training multilingual foundation models (mT5, mBERT variants)

teams building language-specific NLP applications needing diverse training corpora

organizations studying cross-lingual transfer learning and zero-shot capabilities

Requires

Hugging Face Datasets library (datasets>=2.0.0)

Minimum 500GB disk space for full corpus; streaming mode available for limited bandwidth

Python 3.7+

Limitations

Language identification has ~2-5% error rate on code-mixed documents and low-resource languages

No document-level metadata preservation (original URLs, timestamps, source domains removed for privacy)

Imbalanced language representation — high-resource languages (English, Spanish, French) dominate; Tier-3 languages have <1M documents

What makes it unique

Processes 101 languages from a single unified Common Crawl snapshot using fastText language classifiers at scale, rather than separate language-specific crawls or manual curation; achieves language separation without requiring language-specific preprocessing pipelines

vs alternatives

Covers 101 languages in a single coherent dataset vs. competitors like OSCAR or mC4's predecessors which either focus on 10-20 languages or require separate downloads per language

quality filtering and deduplication at scale

Medium confidence

Applies multi-stage filtering heuristics to remove low-quality documents: detects boilerplate/template content using n-gram overlap analysis, removes documents with excessive non-text characters or repetitive patterns, and performs fuzzy deduplication using MinHash signatures to identify near-duplicate documents across the corpus. Filtering operates in streaming mode to avoid materializing entire dataset in memory.

Solves for

I need to remove boilerplate, navigation text, and template content from web crawl dataI want to deduplicate near-identical documents that appear across multiple crawl snapshotsI need to ensure training data quality by filtering out low-signal documents programmatically

Best for

ML teams training language models where data quality directly impacts downstream performance

researchers studying the effect of corpus quality on model convergence and generalization

organizations with limited compute budgets who need to maximize signal-to-noise ratio in training data

Requires

Python 3.7+

datasketch library for MinHash implementation

Sufficient RAM for maintaining deduplication signatures (~50GB for full corpus)

Limitations

Deduplication uses approximate matching (MinHash) — some true duplicates may be missed; false positive rate ~1-3%

Boilerplate detection relies on heuristics (character ratios, repetition patterns) — may incorrectly filter legitimate repetitive content (e.g., poetry, code examples)

No semantic deduplication — documents with identical meaning but different wording are treated as unique

What makes it unique

Combines multi-stage filtering (boilerplate detection via n-gram analysis + MinHash deduplication) in a streaming pipeline that avoids materializing full corpus, enabling processing of petabyte-scale data without distributed compute clusters

vs alternatives

More aggressive quality filtering than raw Common Crawl but less aggressive than curated datasets like Wikipedia, striking a balance between scale and quality that proved optimal for mT5 training

language-stratified dataset sampling and balancing

Medium confidence

Provides mechanisms to sample documents proportionally or uniformly across 101 languages, enabling researchers to create balanced training splits or language-specific subsets. Sampling operates at the dataset configuration level using Hugging Face Datasets' split API, allowing dynamic creation of language-balanced or language-stratified subsets without re-downloading the full corpus.

Solves for

I want to create a balanced training set where each language has equal representationI need to extract data for a specific language or language family for targeted model trainingI want to study how language imbalance affects multilingual model performance

Best for

researchers training multilingual models who want to control language representation in training data

teams building language-specific models who need to isolate single-language subsets efficiently

organizations studying fairness in multilingual NLP and language representation bias

Requires

Hugging Face Datasets library (datasets>=2.0.0)

Python 3.7+

Knowledge of language codes (ISO 639-1 or custom language identifiers used in mC4)

Limitations

Sampling is performed at dataset load time — no persistent sampling configuration; re-running code may yield different random samples

No stratified sampling by other dimensions (domain, quality score, document length) — only language-level stratification

Balancing to equal language representation requires discarding 99%+ of high-resource language data, significantly reducing dataset size

What makes it unique

Integrates language-stratified sampling directly into Hugging Face Datasets' split configuration, enabling dynamic creation of balanced subsets without materializing intermediate datasets or requiring custom sampling scripts

vs alternatives

Provides built-in language-aware sampling vs. generic datasets that require manual filtering; more flexible than fixed pre-split versions because sampling parameters can be adjusted at load time

streaming access to petabyte-scale corpus without full download

Medium confidence

Implements streaming mode via Hugging Face Datasets' streaming API, allowing researchers to iterate over documents sequentially without downloading the entire corpus to disk. Data is fetched on-demand from cloud storage (Hugging Face Hub), with optional local caching of accessed documents. Streaming uses HTTP range requests to fetch only required data chunks, enabling memory-efficient processing on machines with limited storage.

Solves for

I want to experiment with mC4 without committing 500GB+ of disk spaceI need to process the corpus in a streaming fashion for online learning or incremental model trainingI want to quickly prototype with a small sample before committing to full dataset download

Best for

researchers with limited local storage exploring dataset characteristics

teams using cloud compute (Colab, Lambda Labs) where persistent storage is expensive

organizations implementing streaming training pipelines that process data once per epoch

Requires

Stable internet connection (minimum 10 Mbps recommended for practical iteration speed)

Hugging Face Datasets library with streaming support (datasets>=2.4.0)

Python 3.7+

Limitations

Streaming mode is 5-10x slower than local disk access due to network latency and HTTP overhead

No random access — documents must be iterated sequentially; shuffling requires buffering large windows in memory

Network interruptions during streaming can corrupt the iteration; requires manual retry logic

What makes it unique

Leverages Hugging Face Hub's HTTP range request infrastructure to enable true streaming without requiring distributed file systems (HDFS, S3) or local mirroring, making petabyte-scale data accessible from consumer hardware

vs alternatives

Enables streaming access without AWS S3 credentials or Spark clusters, unlike raw Common Crawl access; more practical for individual researchers than downloading full corpus

language-specific metadata and statistics reporting

Medium confidence

Provides aggregated statistics per language including document counts, token counts, character distributions, and quality metrics (deduplication rate, boilerplate removal rate). Statistics are computed during dataset creation and exposed via Hugging Face Datasets' info API, enabling researchers to understand language coverage and data characteristics without processing the full corpus.

Solves for

I need to understand language distribution and coverage in mC4 before deciding which languages to include in trainingI want to quantify data imbalance across languages and plan sampling strategies accordinglyI need to report dataset composition and language coverage in research papers or technical documentation

Best for

researchers writing papers on multilingual model training who need to document dataset composition

teams planning multilingual model training who need to understand language-specific data availability

organizations auditing training data for language representation and fairness

Requires

Hugging Face Datasets library (datasets>=2.0.0)

Python 3.7+

Network access to Hugging Face Hub to fetch dataset info

Limitations

Statistics are pre-computed and static — no real-time updates as new Common Crawl snapshots are released

No per-domain or per-source statistics — only language-level aggregation

Token counts are approximate (based on whitespace tokenization) — not aligned with actual model tokenizers

What makes it unique

Embeds language-stratified statistics directly in Hugging Face Datasets' metadata layer, making coverage and composition queryable without downloading data; statistics are versioned alongside dataset releases

vs alternatives

Provides transparent language coverage statistics vs. competitors like OSCAR which publish aggregate stats separately; enables programmatic access to statistics for automated dataset selection

reproducible dataset versioning and snapshot management

Medium confidence

Maintains versioned snapshots of the mC4 corpus corresponding to specific Common Crawl releases (e.g., 2019-04, 2020-05), enabling researchers to reproduce experiments across time. Versioning is managed through Hugging Face Datasets' revision system, allowing specification of exact dataset versions in code. Each version is immutable and includes metadata about the source Common Crawl snapshot and processing pipeline version.

Solves for

I need to reproduce a published paper's results using the exact same training data versionI want to track how model performance changes across different mC4 snapshotsI need to ensure my experiments are reproducible by pinning dataset versions in my code

Best for

researchers publishing papers who need to ensure reproducibility and enable others to replicate results

teams comparing model performance across time who need to isolate dataset changes from model changes

organizations with long-running training pipelines who need to track data lineage

Requires

Hugging Face Datasets library (datasets>=2.0.0)

Python 3.7+

Knowledge of mC4 version identifiers (e.g., 'main', 'en', specific Common Crawl snapshot IDs)

Limitations

Versioning is tied to Common Crawl release schedule — new versions only available when Common Crawl publishes new snapshots (~monthly)

No ability to create custom versions or branches — only official releases are versioned

Version metadata is minimal — no detailed changelog of filtering/deduplication changes between versions

What makes it unique

Integrates dataset versioning with Hugging Face Hub's Git-like revision system, enabling researchers to specify exact dataset versions in code (e.g., `load_dataset('mc4', revision='2020-05')`) for reproducible experiments

vs alternatives

Provides explicit version pinning vs. raw Common Crawl which requires manual snapshot management; more reproducible than competitors who don't version their processed datasets

language family and script-based document grouping

Medium confidence

Enables filtering and grouping of documents by linguistic properties beyond language code: supports queries by language family (e.g., 'Indo-European', 'Sino-Tibetan'), writing system (e.g., 'Latin', 'Arabic', 'CJK'), or linguistic features (e.g., 'low-resource', 'endangered'). Grouping is implemented via metadata tags assigned during language identification, allowing efficient subset creation for cross-lingual or script-aware research.

Solves for

I want to study cross-lingual transfer within a language family (e.g., Romance languages)I need to analyze how writing system affects model performance on multilingual tasksI want to focus on low-resource languages for zero-shot or few-shot learning research

Best for

linguists and NLP researchers studying cross-lingual phenomena and language families

teams building script-aware models (e.g., multilingual OCR, script identification)

organizations researching low-resource language NLP and transfer learning

Requires

Hugging Face Datasets library (datasets>=2.0.0)

Python 3.7+

Knowledge of language family and script taxonomy used in mC4

Limitations

Language family and script assignments are static — no ability to customize groupings for research-specific taxonomies

Linguistic feature tags (e.g., 'low-resource') are binary — no continuous metrics like speaker population or linguistic diversity

No support for dialect or regional variant grouping — only language-level classification

What makes it unique

Augments language-level filtering with linguistic metadata (family, script, resource level) computed during language identification, enabling cross-lingual research without requiring external linguistic databases

vs alternatives

Provides built-in language family grouping vs. competitors requiring manual mapping of language codes to families; enables script-aware filtering not available in generic multilingual datasets

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with mC4, ranked by overlap. Discovered automatically through the match graph.

Dataset26

fineweb

Dataset by HuggingFaceFW. 6,37,939 downloads.

large-scale web text corpus curation and filteringlanguage detection and english-only filteringdomain-stratified text sampling and split management

3 shared capabilities

Dataset45

CulturaX

6.3T token multilingual dataset across 167 languages.

multilingual-text-deduplication-at-scalequality-filtering-with-language-specific-heuristicslanguage-specific-dataset-slicing-and-sampling

3 shared capabilities

Dataset26

c4

Dataset by allenai. 6,98,456 downloads.

multilingual web-scale text corpus ingestion and deduplicationlanguage detection and multilingual corpus stratification

2 shared capabilities

Dataset46

RedPajama v2

30 trillion token web dataset with 40+ quality signals per document.

multilingual web-scale pretraining corpus provisionlanguage-specific corpus extraction and analysis

2 shared capabilities

Dataset46

C4 (Colossal Clean Crawled Corpus)

Google's cleaned Common Crawl corpus used to train T5.

large-scale english text corpus filtering and deduplicationmulti-language text corpus with 108-language support

2 shared capabilities

Dataset46

FineWeb

Hugging Face's 15T token dataset, new standard for LLM training.

language detection and english isolationmulti-stage web data filtering pipeline

2 shared capabilities

Best For

✓researchers training multilingual foundation models (mT5, mBERT variants)
✓teams building language-specific NLP applications needing diverse training corpora
✓organizations studying cross-lingual transfer learning and zero-shot capabilities
✓ML teams training language models where data quality directly impacts downstream performance
✓researchers studying the effect of corpus quality on model convergence and generalization
✓organizations with limited compute budgets who need to maximize signal-to-noise ratio in training data
✓researchers training multilingual models who want to control language representation in training data
✓teams building language-specific models who need to isolate single-language subsets efficiently

Known Limitations

⚠Language identification has ~2-5% error rate on code-mixed documents and low-resource languages
⚠No document-level metadata preservation (original URLs, timestamps, source domains removed for privacy)
⚠Imbalanced language representation — high-resource languages (English, Spanish, French) dominate; Tier-3 languages have <1M documents
⚠Snapshot dataset — no continuous updates; requires re-processing entire Common Crawl for newer content
⚠Deduplication uses approximate matching (MinHash) — some true duplicates may be missed; false positive rate ~1-3%
⚠Boilerplate detection relies on heuristics (character ratios, repetition patterns) — may incorrectly filter legitimate repetitive content (e.g., poetry, code examples)

Requirements

Hugging Face Datasets library (datasets>=2.0.0)Minimum 500GB disk space for full corpus; streaming mode available for limited bandwidthPython 3.7+Internet connection for downloading from Hugging Face Hubdatasketch library for MinHash implementationSufficient RAM for maintaining deduplication signatures (~50GB for full corpus)Knowledge of language codes (ISO 639-1 or custom language identifiers used in mC4)Stable internet connection (minimum 10 Mbps recommended for practical iteration speed)

Input / Output

Accepts: Common Crawl WET/WARC files (raw web crawl archives), Language identification model (fastText .bin format), Raw text documents from language-identified corpus, Document metadata (language, source domain), Language identifier for filtering (e.g., 'en', 'fr', 'zh'), Sampling parameters (sample size, random seed), Dataset configuration (language, split), Streaming parameters (buffer size, cache location), Dataset identifier ('mc4'), Language code (optional, for language-specific stats), Dataset identifier and revision/version string, Language and split configuration, Language family identifier (e.g., 'Indo-European'), Writing system code (e.g., 'Latin', 'Arabic'), Linguistic feature tag (e.g., 'low-resource')

Produces: Parquet files (columnar format with text, language, and metadata fields), Hugging Face Dataset objects (streaming or cached), JSON Lines format for downstream processing, Filtered document corpus (same format as input), Deduplication statistics and quality metrics, Removed documents log (optional, for analysis), Hugging Face Dataset object with filtered/sampled documents, Language distribution statistics (document counts per language), Iterator over document batches, Individual documents as dictionaries with 'text' and 'language' fields, JSON-formatted statistics (document counts, token counts, language distribution), Hugging Face DatasetInfo object with metadata, Summary tables suitable for research papers, Hugging Face Dataset object pinned to specific version, Version metadata (Common Crawl snapshot date, processing pipeline version), Filtered dataset containing documents matching linguistic criteria, List of languages in specified family/script, Language family and script distribution statistics

UnfragileRank

Adoption70%(35% weight)

Quality23%(25% weight)

Ecosystem40%(20% weight)

Match Graph10%(15% weight)

Freshness100%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Dataset

7 capabilities

Visit mC4→

About

Multilingual Colossal Clean Crawled Corpus covering 101 languages extracted from Common Crawl with language identification and quality filtering, providing the training data for mT5 and multilingual model research.

Alternatives to mC4

cua53Agent

Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).

Compare →

Hugging Face43Platform

The GitHub for AI — 500K+ models, datasets, Spaces, Inference API, hub for open-source AI.

Compare →

Stable-Diffusion55Repository

FLUX, Stable Diffusion, SDXL, SD3, LoRA, Fine Tuning, DreamBooth, Training, Automatic1111, Forge WebUI, SwarmUI, DeepFake, TTS, Animation, Text To Video, Tutorials, Guides, Lectures, Courses, ComfyUI, Google Colab, RunPod, Kaggle, NoteBooks, ControlNet, TTS, Voice Cloning, AI, AI News, ML, ML News,

Compare →

YOLOv846Model

Real-time object detection, segmentation, and pose.

Compare →

Are you the builder of mC4?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

seed developer essentials

Looking for something else?

Search →

Capabilities7 decomposed

multilingual text corpus extraction from web crawl

Medium confidence

Solves for

Best for

researchers training multilingual foundation models (mT5, mBERT variants)

teams building language-specific NLP applications needing diverse training corpora

organizations studying cross-lingual transfer learning and zero-shot capabilities

Requires

Hugging Face Datasets library (datasets>=2.0.0)

Minimum 500GB disk space for full corpus; streaming mode available for limited bandwidth

Python 3.7+

Limitations

Language identification has ~2-5% error rate on code-mixed documents and low-resource languages

No document-level metadata preservation (original URLs, timestamps, source domains removed for privacy)

Imbalanced language representation — high-resource languages (English, Spanish, French) dominate; Tier-3 languages have <1M documents

What makes it unique

vs alternatives

Covers 101 languages in a single coherent dataset vs. competitors like OSCAR or mC4's predecessors which either focus on 10-20 languages or require separate downloads per language

quality filtering and deduplication at scale

Medium confidence

Solves for

Best for

ML teams training language models where data quality directly impacts downstream performance

researchers studying the effect of corpus quality on model convergence and generalization

organizations with limited compute budgets who need to maximize signal-to-noise ratio in training data

Requires

Python 3.7+

datasketch library for MinHash implementation

Sufficient RAM for maintaining deduplication signatures (~50GB for full corpus)

Limitations

Deduplication uses approximate matching (MinHash) — some true duplicates may be missed; false positive rate ~1-3%

Boilerplate detection relies on heuristics (character ratios, repetition patterns) — may incorrectly filter legitimate repetitive content (e.g., poetry, code examples)

No semantic deduplication — documents with identical meaning but different wording are treated as unique

What makes it unique

vs alternatives

More aggressive quality filtering than raw Common Crawl but less aggressive than curated datasets like Wikipedia, striking a balance between scale and quality that proved optimal for mT5 training

language-stratified dataset sampling and balancing

Medium confidence

Solves for

Best for

researchers training multilingual models who want to control language representation in training data

teams building language-specific models who need to isolate single-language subsets efficiently

organizations studying fairness in multilingual NLP and language representation bias

Requires

Hugging Face Datasets library (datasets>=2.0.0)

Python 3.7+

Knowledge of language codes (ISO 639-1 or custom language identifiers used in mC4)

Limitations

Sampling is performed at dataset load time — no persistent sampling configuration; re-running code may yield different random samples

No stratified sampling by other dimensions (domain, quality score, document length) — only language-level stratification

Balancing to equal language representation requires discarding 99%+ of high-resource language data, significantly reducing dataset size

What makes it unique

vs alternatives

Provides built-in language-aware sampling vs. generic datasets that require manual filtering; more flexible than fixed pre-split versions because sampling parameters can be adjusted at load time

streaming access to petabyte-scale corpus without full download

Medium confidence

Solves for

Best for

researchers with limited local storage exploring dataset characteristics

teams using cloud compute (Colab, Lambda Labs) where persistent storage is expensive

organizations implementing streaming training pipelines that process data once per epoch

Requires

Stable internet connection (minimum 10 Mbps recommended for practical iteration speed)

Hugging Face Datasets library with streaming support (datasets>=2.4.0)

Python 3.7+

Limitations

Streaming mode is 5-10x slower than local disk access due to network latency and HTTP overhead

No random access — documents must be iterated sequentially; shuffling requires buffering large windows in memory

Network interruptions during streaming can corrupt the iteration; requires manual retry logic

What makes it unique

vs alternatives

Enables streaming access without AWS S3 credentials or Spark clusters, unlike raw Common Crawl access; more practical for individual researchers than downloading full corpus

language-specific metadata and statistics reporting

Medium confidence

Solves for

Best for

researchers writing papers on multilingual model training who need to document dataset composition

teams planning multilingual model training who need to understand language-specific data availability

organizations auditing training data for language representation and fairness

Requires

Hugging Face Datasets library (datasets>=2.0.0)

Python 3.7+

Network access to Hugging Face Hub to fetch dataset info

Limitations

Statistics are pre-computed and static — no real-time updates as new Common Crawl snapshots are released

No per-domain or per-source statistics — only language-level aggregation

Token counts are approximate (based on whitespace tokenization) — not aligned with actual model tokenizers

What makes it unique

vs alternatives

Provides transparent language coverage statistics vs. competitors like OSCAR which publish aggregate stats separately; enables programmatic access to statistics for automated dataset selection

reproducible dataset versioning and snapshot management

Medium confidence

Solves for

Best for

researchers publishing papers who need to ensure reproducibility and enable others to replicate results

teams comparing model performance across time who need to isolate dataset changes from model changes

organizations with long-running training pipelines who need to track data lineage

Requires

Hugging Face Datasets library (datasets>=2.0.0)

Python 3.7+

Knowledge of mC4 version identifiers (e.g., 'main', 'en', specific Common Crawl snapshot IDs)

Limitations

Versioning is tied to Common Crawl release schedule — new versions only available when Common Crawl publishes new snapshots (~monthly)

No ability to create custom versions or branches — only official releases are versioned

Version metadata is minimal — no detailed changelog of filtering/deduplication changes between versions

What makes it unique

vs alternatives

Provides explicit version pinning vs. raw Common Crawl which requires manual snapshot management; more reproducible than competitors who don't version their processed datasets

language family and script-based document grouping

Medium confidence

Solves for

Best for

linguists and NLP researchers studying cross-lingual phenomena and language families

teams building script-aware models (e.g., multilingual OCR, script identification)

organizations researching low-resource language NLP and transfer learning

Requires

Hugging Face Datasets library (datasets>=2.0.0)

Python 3.7+

Knowledge of language family and script taxonomy used in mC4

Limitations

Language family and script assignments are static — no ability to customize groupings for research-specific taxonomies

Linguistic feature tags (e.g., 'low-resource') are binary — no continuous metrics like speaker population or linguistic diversity

No support for dialect or regional variant grouping — only language-level classification

What makes it unique

vs alternatives

Provides built-in language family grouping vs. competitors requiring manual mapping of language codes to families; enables script-aware filtering not available in generic multilingual datasets

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Alternatives to mC4

cua53Agent

Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).

Compare →

Hugging Face43Platform

The GitHub for AI — 500K+ models, datasets, Spaces, Inference API, hub for open-source AI.

Compare →

Stable-Diffusion55Repository

Compare →

YOLOv846Model

Real-time object detection, segmentation, and pose.

Compare →

mC4

Capabilities7 decomposed

multilingual text corpus extraction from web crawl

quality filtering and deduplication at scale

language-stratified dataset sampling and balancing

streaming access to petabyte-scale corpus without full download

language-specific metadata and statistics reporting

reproducible dataset versioning and snapshot management

language family and script-based document grouping

Related Artifactssharing capabilities

fineweb

CulturaX

c4

RedPajama v2

C4 (Colossal Clean Crawled Corpus)

FineWeb

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to mC4

Are you the builder of mC4?

Get the weekly brief

Data Sources

mC4

Capabilities7 decomposed

multilingual text corpus extraction from web crawl

quality filtering and deduplication at scale

language-stratified dataset sampling and balancing

streaming access to petabyte-scale corpus without full download

language-specific metadata and statistics reporting

reproducible dataset versioning and snapshot management

language family and script-based document grouping

Related Artifactssharing capabilities

fineweb

CulturaX

c4

RedPajama v2

C4 (Colossal Clean Crawled Corpus)

FineWeb

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to mC4

Are you the builder of mC4?

Get the weekly brief

Data Sources