multilingual pretraining corpus assembly with explicit language coverage, language-specific subset filtering and selective loading, source provenance and licensing metadata retrieval, distributed streaming access for large-scale training pipelines, programming language code corpus with language-specific organization, community-driven data curation and governance documentation, community-curated data quality annotations and bias documentation, multilingual dataset for model training

ROOTS

DatasetFree

BigScience's curated multilingual dataset for BLOOM.

Open Source

signed passport verify →

/ 100

8 capabilities

Best for: multilingual pretraining corpus assembly with explicit language coverage, language-specific subset filtering and selective loading, source provenance and licensing metadata retrieval
Type: Dataset · Free
Score: 57/100
Best alternative: Hugging Face MCP Server

Capabilities8 decomposed

multilingual pretraining corpus assembly with explicit language coverage

Medium confidence

ROOTS provides a curated collection of 46 natural languages and 13 programming languages organized into discrete, versioned subsets with documented sourcing and licensing metadata. The dataset uses a modular architecture where each language community contributed curation decisions, enabling downstream models like BLOOM to train on balanced multilingual representations without requiring custom data collection pipelines. Data is indexed by language code and accessible via Hugging Face Datasets API with streaming support for large-scale distributed training.

Solves for

Train a multilingual language model without building a custom data pipeline from scratchUnderstand the exact composition and provenance of training data used in BLOOMReplicate or extend multilingual pretraining with transparent data governanceAccess balanced language representation across 46+ languages for fair model evaluation

Best for

ML researchers training multilingual foundation models

Teams reproducing BLOOM or building variants with similar language coverage

Organizations requiring transparent, documented training data for compliance

Requires

Hugging Face Datasets library (datasets>=2.0.0)

Python 3.7+

~1.6TB disk space for full local copy or streaming internet connection

Limitations

Dataset is fixed and immutable — no ability to add new languages or reweight existing ones post-publication

Streaming from Hugging Face requires internet connectivity; full download is ~1.6TB uncompressed

Language representation is not perfectly balanced — some languages have significantly more data than others due to source availability

What makes it unique

ROOTS implements community-driven data governance through explicit BigScience working groups per language, with published sourcing documents and licensing matrices that map each data subset to its original source and legal terms — a level of transparency rarely matched by proprietary training datasets. The dataset is versioned and immutable, enabling reproducible research and audit trails.

vs alternatives

Unlike Common Crawl or Wikipedia-only approaches, ROOTS provides curated, language-specific subsets with documented provenance and explicit governance decisions, making it suitable for research requiring transparent data sourcing and fair multilingual representation.

language-specific subset filtering and selective loading

Medium confidence

ROOTS enables fine-grained selection of training data by language, programming language, or source category through the Hugging Face Datasets API's filtering and split mechanisms. Users can load only subsets relevant to their task (e.g., only English + French, or only code data) without downloading the full corpus, reducing storage and compute overhead. The dataset structure uses language codes as primary keys, allowing efficient subset materialization during training pipeline initialization.

Solves for

Train a bilingual model on only English and French without downloading 40+ other languagesBuild a code-focused model using only the 13 programming language subsetsEvaluate model performance on specific language groups (e.g., low-resource languages) in isolationReduce training data size and iteration time by selecting relevant language subsets

Best for

Teams with limited storage or compute budgets targeting specific languages

Researchers studying language-specific model behavior or bias

Production teams fine-tuning models for specific language pairs

Requires

Hugging Face Datasets library with split/subset support

Knowledge of language codes used in ROOTS (ISO 639-1/3 or custom codes)

Python 3.7+

Limitations

Subset selection is static at load time — cannot dynamically reweight languages during training without reloading

No built-in cross-lingual deduplication — same content may appear in multiple language subsets if sourced from multilingual documents

Filtering is coarse-grained (by language code) — no finer filtering by domain, quality score, or document length within a language

What makes it unique

ROOTS organizes data with language as the primary partitioning key, enabling zero-copy subset selection at the Datasets API level — users can load only relevant languages without materializing the full corpus, a design choice that reduces memory overhead compared to post-hoc filtering on monolithic datasets.

vs alternatives

Compared to monolithic pretraining datasets like C4, ROOTS's language-partitioned structure allows selective loading without downloading irrelevant data, reducing iteration time and storage costs for multilingual or language-specific training.

source provenance and licensing metadata retrieval

Medium confidence

ROOTS includes structured metadata for each data subset documenting original source (e.g., Wikipedia, GitHub, web crawls), license type (CC-BY, MIT, public domain), and curation decisions made by BigScience working groups. This metadata is accessible via dataset cards and supplementary documentation files, enabling users to audit data lineage, verify legal compliance, and understand potential biases introduced by source selection. The metadata structure maps each language subset to its upstream sources with explicit attribution.

Solves for

Verify that training data complies with organizational or regulatory data sourcing policiesUnderstand which sources contributed to a specific language subset for bias analysisAttribute data sources in model documentation and research papersIdentify and exclude data from specific sources if needed (e.g., due to license conflicts)

Best for

Compliance and legal teams validating training data for regulatory requirements

Researchers studying dataset bias and source effects on model behavior

Organizations publishing models and needing transparent attribution

Requires

Access to ROOTS dataset card on Hugging Face Hub

BigScience documentation files (available in the repository)

Manual review capability for legal/compliance teams

Limitations

Metadata is descriptive but not machine-queryable at scale — requires manual inspection of documentation files

Source attribution is at the subset level, not per-document — cannot trace individual records to original sources

License information is provided but enforcement is the user's responsibility — ROOTS does not prevent use of data in violation of stated licenses

What makes it unique

ROOTS publishes explicit sourcing documents and licensing matrices for each language subset, created through community-driven BigScience working groups — a governance model that makes data provenance a first-class artifact rather than an afterthought, enabling reproducible audits of training data composition.

vs alternatives

Unlike proprietary datasets or web crawls with opaque sourcing, ROOTS provides documented source attribution and licensing for each subset, enabling compliance verification and bias analysis that would be impossible with undocumented data.

distributed streaming access for large-scale training pipelines

Medium confidence

ROOTS integrates with Hugging Face Datasets' streaming API, enabling distributed training systems to fetch data on-the-fly without materializing the full corpus locally. The dataset is partitioned by language, allowing multiple training nodes to load different language subsets in parallel via HTTP range requests. This architecture supports efficient distributed training on clusters with limited aggregate storage, as each node streams only its assigned language subset during training iterations.

Solves for

Train on ROOTS using distributed training frameworks (PyTorch DDP, DeepSpeed) without requiring shared storageReduce per-node storage requirements by streaming data on-demand during trainingEnable rapid iteration on model architecture without waiting for full dataset downloadsScale training across multiple nodes with independent data loading pipelines

Best for

Teams with distributed training infrastructure (multi-GPU clusters, cloud training)

Organizations with limited per-node storage but high network bandwidth

Research groups iterating rapidly on model architectures and needing fast data access

Requires

Hugging Face Datasets library with streaming support (datasets>=2.0.0)

Python 3.7+

Network connectivity to Hugging Face Hub (or self-hosted mirror)

Limitations

Streaming introduces network latency (~10-50ms per batch fetch) compared to local disk I/O

Requires stable, high-bandwidth internet connection — not suitable for offline training or unreliable networks

Streaming performance degrades if multiple nodes fetch the same subset simultaneously — no built-in caching or CDN

What makes it unique

ROOTS's language-partitioned structure enables efficient distributed streaming where each training node can independently fetch its assigned language subset via HTTP range requests, avoiding the need for shared storage or centralized data servers — a design that scales to large clusters without storage bottlenecks.

vs alternatives

Compared to datasets requiring full local copies (e.g., pre-downloaded tarballs), ROOTS streaming reduces storage overhead and enables rapid scaling across distributed clusters, though at the cost of network latency.

programming language code corpus with language-specific organization

Medium confidence

ROOTS includes 13 programming language subsets (Python, Java, C++, JavaScript, etc.) organized as separate, versioned datasets within the larger corpus. Each programming language subset is curated from sources like GitHub and Stack Overflow, with language-specific metadata (e.g., license type, repository stars). The code data is structured as raw source files with minimal preprocessing, enabling downstream models to learn language-specific syntax and idioms without artificial normalization.

Solves for

Train code generation or code understanding models on diverse programming languagesBuild language-specific models (e.g., Python-only) by selecting relevant code subsetsEvaluate model performance on code tasks across multiple languagesUnderstand the composition and quality of code data used in BLOOM's training

Best for

ML researchers building code-focused language models or code completion tools

Teams training models for specific programming languages

Organizations studying code model bias and performance across languages

Requires

Hugging Face Datasets library

Python 3.7+

Knowledge of programming language codes used in ROOTS

Limitations

Code data is raw source without semantic parsing — no AST-level structure or code quality filtering

License information for code is less granular than natural language subsets — some code may have unclear licensing

No deduplication of code snippets — identical functions may appear multiple times across repositories

What makes it unique

ROOTS organizes code data by programming language as first-class subsets (13 languages), enabling language-specific model training and evaluation — a design choice that treats code as a distinct modality from natural language rather than mixing them in a monolithic corpus.

vs alternatives

Unlike code datasets that mix multiple languages or apply heavy preprocessing, ROOTS provides raw, language-partitioned code subsets with explicit sourcing, enabling researchers to study language-specific code model behavior and build specialized models.

community-driven data curation and governance documentation

Medium confidence

ROOTS was assembled through BigScience working groups organized by language and domain, where community members made explicit curation decisions about which sources to include, how to weight languages, and how to handle licensing conflicts. These decisions are documented in published working group reports and dataset cards, creating an auditable record of how the dataset was constructed. The governance model enables reproducibility and allows researchers to understand the human decisions that shaped the training data.

Solves for

Understand the human curation decisions that shaped ROOTS and how they may affect model behaviorReplicate or extend ROOTS by following the documented curation methodologyContribute to future versions of ROOTS by participating in community curation processesAudit dataset composition for potential biases introduced by curation decisions

Best for

Researchers studying dataset bias and the impact of curation on model behavior

Teams building open-source datasets and seeking governance models to emulate

Organizations requiring transparent, auditable data sourcing processes

Requires

Access to BigScience working group reports and dataset cards

Ability to read and interpret governance documentation

Optional: participation in BigScience community forums or working groups

Limitations

Governance documentation is descriptive and not machine-queryable — requires manual review to understand curation decisions

Community-driven curation is slower and more complex than centralized decision-making — may introduce inconsistencies across language groups

Documentation may be incomplete or outdated if working groups are no longer active

What makes it unique

ROOTS implements governance as a first-class artifact through published BigScience working group reports that document curation decisions, source selection rationale, and community input — treating data governance as a transparent, reproducible process rather than a black box.

vs alternatives

Unlike proprietary datasets with opaque curation, ROOTS publishes explicit governance documentation enabling researchers to audit curation decisions and understand how they may affect model behavior — a transparency model that supports reproducible research and community accountability.

community-curated data quality annotations and bias documentation

Medium confidence

ROOTS includes community-contributed annotations documenting known biases, quality issues, and limitations in specific sources, stored as structured metadata. These annotations are curated by BigScience and the research community, providing qualitative assessments of data quality and potential harms that complement quantitative metrics, enabling informed decisions about source inclusion.

Solves for

Understand known biases and limitations in specific data sources before including them in trainingMake informed decisions about excluding sources with documented quality or ethical concernsDocument known limitations of your model's training data in model cardsContribute quality annotations for sources you've analyzed

Best for

Teams building models with explicit bias and fairness considerations

Researchers studying bias in pretraining corpora

Organizations with ethical AI governance requirements

Requires

Access to ROOTS documentation and metadata

Understanding of bias types and fairness concepts

Critical reading skills for interpreting qualitative annotations

Limitations

Bias annotations are qualitative and subjective; no standardized bias metrics provided

Coverage is incomplete; not all sources have detailed bias documentation

Annotations reflect BigScience's perspective and may not capture all relevant concerns

What makes it unique

Incorporates community-curated bias and quality annotations as dataset metadata, treating data governance as an ongoing collaborative process rather than a one-time curation effort. This enables researchers to make informed decisions about data inclusion based on documented concerns.

vs alternatives

More transparent about known biases than datasets with minimal documentation; enables bias-aware training unlike datasets that treat data as neutral. Comparable to other BigScience datasets but with more extensive community input.

multilingual dataset for model training

Medium confidence

ROOTS is a curated multilingual dataset designed for training language models, covering 46 natural languages and 13 programming languages with a focus on data governance and community curation.

Solves for

best multilingual dataset for trainingdataset for training language modelsfree datasets for NLPcurated datasets for AI training+1 more

Best for

NLP model training

multilingual applications

What makes it unique

ROOTS stands out due to its extensive coverage of both natural and programming languages with a strong emphasis on data governance.

vs alternatives

Compared to other datasets, ROOTS offers a unique combination of multilingual support and community-driven curation.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with ROOTS, ranked by overlap. Discovered automatically through the match graph.

Dataset58

mC4

Multilingual web corpus covering 101 languages.

multilingual-text-corpus-extraction-from-web-crawllanguage-specific-corpus-filtering-and-subset-selectionmultilingual-language-identification-and-segmentation

3 shared capabilities

Dataset59

OPUS

Massive parallel corpus for machine translation.

multilingual parallel corpus discovery via searchable indexdomain-specific parallel corpus selection and filteringbulk parallel corpus download with source-specific formatting

3 shared capabilities

Dataset61

RedPajama v2

30 trillion token web dataset with 40+ quality signals per document.

multilingual web corpus with consistent annotation across 5 languagesmulti-language web-scale document collection with 40+ quality annotations

2 shared capabilities

Dataset57

StarCoder Data

783 GB curated code dataset from 86 languages with PII redaction.

multi-language code corpus assembly with permissive licensing verificationmulti-language code representation with language-specific tokenization

2 shared capabilities

Dataset25

c4

Dataset by allenai. 7,61,810 downloads.

language detection and multilingual corpus stratificationmultilingual web-scale text corpus ingestion and deduplication

2 shared capabilities

Dataset60

The Pile

EleutherAI's 825 GiB diverse training dataset from 22 sources.

multi-domain pretraining corpus assembly

1 shared capability

Best For

✓ML researchers training multilingual foundation models
✓Teams reproducing BLOOM or building variants with similar language coverage
✓Organizations requiring transparent, documented training data for compliance
✓Teams with limited storage or compute budgets targeting specific languages
✓Researchers studying language-specific model behavior or bias
✓Production teams fine-tuning models for specific language pairs
✓Compliance and legal teams validating training data for regulatory requirements
✓Researchers studying dataset bias and source effects on model behavior

Known Limitations

⚠Dataset is fixed and immutable — no ability to add new languages or reweight existing ones post-publication
⚠Streaming from Hugging Face requires internet connectivity; full download is ~1.6TB uncompressed
⚠Language representation is not perfectly balanced — some languages have significantly more data than others due to source availability
⚠No built-in deduplication or quality filtering at the record level — relies on upstream source curation
⚠Subset selection is static at load time — cannot dynamically reweight languages during training without reloading
⚠No built-in cross-lingual deduplication — same content may appear in multiple language subsets if sourced from multilingual documents

Requirements

Hugging Face Datasets library (datasets>=2.0.0)Python 3.7+~1.6TB disk space for full local copy or streaming internet connectionHugging Face account for authenticated access to some restricted subsetsHugging Face Datasets library with split/subset supportKnowledge of language codes used in ROOTS (ISO 639-1/3 or custom codes)Access to ROOTS dataset card on Hugging Face HubBigScience documentation files (available in the repository)

Input / Output

Accepts: language code (ISO 639-1 or 639-3), subset name (e.g., 'en', 'fr', 'code_python'), split identifier (train/validation), language code string (e.g., 'en', 'zh', 'code_python'), split name (e.g., 'train'), optional filtering predicates, language code or subset name, optional source name filter, language subset identifier, split name (train/validation), batch size and number of workers, programming language code (e.g., 'code_python', 'code_java'), language or domain name, optional query about curation decisions, source name or identifier, optional bias category filter

Produces: raw text documents, structured records with metadata (source, license, language), streaming iterables for distributed training, filtered dataset object, streaming iterator over selected language subset, metadata about selected subset (size, document count), structured metadata (source name, license, date range, document count), attribution text for citations, licensing matrix (CSV or JSON), streaming DataLoader or IterableDataset, batched tensors for model training, metadata about current batch (language, source), raw source code text, structured records with metadata (language, source, license), streaming iterables for training, working group reports (PDF/Markdown), dataset cards with curation rationale, governance matrices and decision logs, bias annotations (text), quality assessments, recommendations for source inclusion/exclusion

UnfragileRank

Adoption70%(30% weight)

Quality85%(25% weight)

Ecosystem30%(10% weight)

Match Graph25%(30% weight)

Freshness90%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Dataset

8 capabilities

Visit ROOTS→

About

BigScience's curated multilingual dataset used to train BLOOM, covering 46 natural languages and 13 programming languages with explicit data governance, sourcing documentation, and community-driven curation.

Alternatives to ROOTS

Hugging Face MCP Server62MCP Server

Official Hugging Face MCP — search models/datasets/Spaces/papers and call Spaces as tools.

Compare →

Langfuse57Repository

Open-source LLM observability — tracing, prompt management, evaluation, cost tracking, self-hosted.

Compare →

The Stack v259Dataset

67 TB permissively licensed code dataset across 600+ languages.

Compare →

The Pile60Dataset

EleutherAI's 825 GiB diverse training dataset from 22 sources.

Compare →

See all alternatives to ROOTS→

Are you the builder of ROOTS?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

seed developer essentials

Looking for something else?

Search →

Capabilities8 decomposed

multilingual pretraining corpus assembly with explicit language coverage

Medium confidence

Solves for

Best for

ML researchers training multilingual foundation models

Teams reproducing BLOOM or building variants with similar language coverage

Organizations requiring transparent, documented training data for compliance

Requires

Hugging Face Datasets library (datasets>=2.0.0)

Python 3.7+

~1.6TB disk space for full local copy or streaming internet connection

Limitations

Dataset is fixed and immutable — no ability to add new languages or reweight existing ones post-publication

Streaming from Hugging Face requires internet connectivity; full download is ~1.6TB uncompressed

Language representation is not perfectly balanced — some languages have significantly more data than others due to source availability

What makes it unique

vs alternatives

language-specific subset filtering and selective loading

Medium confidence

Solves for

Best for

Teams with limited storage or compute budgets targeting specific languages

Researchers studying language-specific model behavior or bias

Production teams fine-tuning models for specific language pairs

Requires

Hugging Face Datasets library with split/subset support

Knowledge of language codes used in ROOTS (ISO 639-1/3 or custom codes)

Python 3.7+

Limitations

Subset selection is static at load time — cannot dynamically reweight languages during training without reloading

No built-in cross-lingual deduplication — same content may appear in multiple language subsets if sourced from multilingual documents

Filtering is coarse-grained (by language code) — no finer filtering by domain, quality score, or document length within a language

What makes it unique

vs alternatives

source provenance and licensing metadata retrieval

Medium confidence

Solves for

Best for

Compliance and legal teams validating training data for regulatory requirements

Researchers studying dataset bias and source effects on model behavior

Organizations publishing models and needing transparent attribution

Requires

Access to ROOTS dataset card on Hugging Face Hub

BigScience documentation files (available in the repository)

Manual review capability for legal/compliance teams

Limitations

Metadata is descriptive but not machine-queryable at scale — requires manual inspection of documentation files

Source attribution is at the subset level, not per-document — cannot trace individual records to original sources

License information is provided but enforcement is the user's responsibility — ROOTS does not prevent use of data in violation of stated licenses

What makes it unique

vs alternatives

distributed streaming access for large-scale training pipelines

Medium confidence

Solves for

Best for

Teams with distributed training infrastructure (multi-GPU clusters, cloud training)

Organizations with limited per-node storage but high network bandwidth

Research groups iterating rapidly on model architectures and needing fast data access

Requires

Hugging Face Datasets library with streaming support (datasets>=2.0.0)

Python 3.7+

Network connectivity to Hugging Face Hub (or self-hosted mirror)

Limitations

Streaming introduces network latency (~10-50ms per batch fetch) compared to local disk I/O

Requires stable, high-bandwidth internet connection — not suitable for offline training or unreliable networks

Streaming performance degrades if multiple nodes fetch the same subset simultaneously — no built-in caching or CDN

What makes it unique

vs alternatives

programming language code corpus with language-specific organization

Medium confidence

Solves for

Best for

ML researchers building code-focused language models or code completion tools

Teams training models for specific programming languages

Organizations studying code model bias and performance across languages

Requires

Hugging Face Datasets library

Python 3.7+

Knowledge of programming language codes used in ROOTS

Limitations

Code data is raw source without semantic parsing — no AST-level structure or code quality filtering

License information for code is less granular than natural language subsets — some code may have unclear licensing

No deduplication of code snippets — identical functions may appear multiple times across repositories

What makes it unique

vs alternatives

community-driven data curation and governance documentation

Medium confidence

Solves for

Best for

Researchers studying dataset bias and the impact of curation on model behavior

Teams building open-source datasets and seeking governance models to emulate

Organizations requiring transparent, auditable data sourcing processes

Requires

Access to BigScience working group reports and dataset cards

Ability to read and interpret governance documentation

Optional: participation in BigScience community forums or working groups

Limitations

Governance documentation is descriptive and not machine-queryable — requires manual review to understand curation decisions

Community-driven curation is slower and more complex than centralized decision-making — may introduce inconsistencies across language groups

Documentation may be incomplete or outdated if working groups are no longer active

What makes it unique

vs alternatives

community-curated data quality annotations and bias documentation

Medium confidence

Solves for

Best for

Teams building models with explicit bias and fairness considerations

Researchers studying bias in pretraining corpora

Organizations with ethical AI governance requirements

Requires

Access to ROOTS documentation and metadata

Understanding of bias types and fairness concepts

Critical reading skills for interpreting qualitative annotations

Limitations

Bias annotations are qualitative and subjective; no standardized bias metrics provided

Coverage is incomplete; not all sources have detailed bias documentation

Annotations reflect BigScience's perspective and may not capture all relevant concerns

What makes it unique

vs alternatives

multilingual dataset for model training

Medium confidence

ROOTS is a curated multilingual dataset designed for training language models, covering 46 natural languages and 13 programming languages with a focus on data governance and community curation.

Solves for

best multilingual dataset for trainingdataset for training language modelsfree datasets for NLPcurated datasets for AI training+1 more

Best for

NLP model training

multilingual applications

What makes it unique

ROOTS stands out due to its extensive coverage of both natural and programming languages with a strong emphasis on data governance.

vs alternatives

Compared to other datasets, ROOTS offers a unique combination of multilingual support and community-driven curation.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Alternatives to ROOTS

Hugging Face MCP Server62MCP Server

Official Hugging Face MCP — search models/datasets/Spaces/papers and call Spaces as tools.

Compare →

Langfuse57Repository

Open-source LLM observability — tracing, prompt management, evaluation, cost tracking, self-hosted.

Compare →

The Stack v259Dataset

67 TB permissively licensed code dataset across 600+ languages.

Compare →

The Pile60Dataset

EleutherAI's 825 GiB diverse training dataset from 22 sources.

Compare →

See all alternatives to ROOTS→

ROOTS

Capabilities8 decomposed

multilingual pretraining corpus assembly with explicit language coverage

language-specific subset filtering and selective loading

source provenance and licensing metadata retrieval

distributed streaming access for large-scale training pipelines

programming language code corpus with language-specific organization

community-driven data curation and governance documentation

community-curated data quality annotations and bias documentation

multilingual dataset for model training

Related Artifactssharing capabilities

mC4

OPUS

RedPajama v2

StarCoder Data

c4

The Pile

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to ROOTS

Are you the builder of ROOTS?

Get the weekly brief

Data Sources

ROOTS

Capabilities8 decomposed

multilingual pretraining corpus assembly with explicit language coverage

language-specific subset filtering and selective loading

source provenance and licensing metadata retrieval

distributed streaming access for large-scale training pipelines

programming language code corpus with language-specific organization

community-driven data curation and governance documentation

community-curated data quality annotations and bias documentation

multilingual dataset for model training

Related Artifactssharing capabilities

mC4

OPUS

RedPajama v2

StarCoder Data

c4

The Pile

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to ROOTS

Are you the builder of ROOTS?

Get the weekly brief

Data Sources