{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"roots","slug":"roots","name":"ROOTS","type":"dataset","url":"https://huggingface.co/datasets/bigscience-data/roots","page_url":"https://unfragile.ai/roots","categories":["model-training"],"tags":[],"pricing":{"model":"free","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"roots__cap_0","uri":"capability://data.processing.analysis.multilingual.pretraining.corpus.assembly.with.explicit.language.coverage","name":"multilingual pretraining corpus assembly with explicit language coverage","description":"ROOTS provides a curated collection of 46 natural languages and 13 programming languages organized into discrete, versioned subsets with documented sourcing and licensing metadata. The dataset uses a modular architecture where each language community contributed curation decisions, enabling downstream models like BLOOM to train on balanced multilingual representations without requiring custom data collection pipelines. Data is indexed by language code and accessible via Hugging Face Datasets API with streaming support for large-scale distributed training.","intents":["Train a multilingual language model without building a custom data pipeline from scratch","Understand the exact composition and provenance of training data used in BLOOM","Replicate or extend multilingual pretraining with transparent data governance","Access balanced language representation across 46+ languages for fair model evaluation"],"best_for":["ML researchers training multilingual foundation models","Teams reproducing BLOOM or building variants with similar language coverage","Organizations requiring transparent, documented training data for compliance"],"limitations":["Dataset is fixed and immutable — no ability to add new languages or reweight existing ones post-publication","Streaming from Hugging Face requires internet connectivity; full download is ~1.6TB uncompressed","Language representation is not perfectly balanced — some languages have significantly more data than others due to source availability","No built-in deduplication or quality filtering at the record level — relies on upstream source curation"],"requires":["Hugging Face Datasets library (datasets>=2.0.0)","Python 3.7+","~1.6TB disk space for full local copy or streaming internet connection","Hugging Face account for authenticated access to some restricted subsets"],"input_types":["language code (ISO 639-1 or 639-3)","subset name (e.g., 'en', 'fr', 'code_python')","split identifier (train/validation)"],"output_types":["raw text documents","structured records with metadata (source, license, language)","streaming iterables for distributed training"],"categories":["data-processing-analysis","model-training"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"roots__cap_1","uri":"capability://data.processing.analysis.language.specific.subset.filtering.and.selective.loading","name":"language-specific subset filtering and selective loading","description":"ROOTS enables fine-grained selection of training data by language, programming language, or source category through the Hugging Face Datasets API's filtering and split mechanisms. Users can load only subsets relevant to their task (e.g., only English + French, or only code data) without downloading the full corpus, reducing storage and compute overhead. The dataset structure uses language codes as primary keys, allowing efficient subset materialization during training pipeline initialization.","intents":["Train a bilingual model on only English and French without downloading 40+ other languages","Build a code-focused model using only the 13 programming language subsets","Evaluate model performance on specific language groups (e.g., low-resource languages) in isolation","Reduce training data size and iteration time by selecting relevant language subsets"],"best_for":["Teams with limited storage or compute budgets targeting specific languages","Researchers studying language-specific model behavior or bias","Production teams fine-tuning models for specific language pairs"],"limitations":["Subset selection is static at load time — cannot dynamically reweight languages during training without reloading","No built-in cross-lingual deduplication — same content may appear in multiple language subsets if sourced from multilingual documents","Filtering is coarse-grained (by language code) — no finer filtering by domain, quality score, or document length within a language"],"requires":["Hugging Face Datasets library with split/subset support","Knowledge of language codes used in ROOTS (ISO 639-1/3 or custom codes)","Python 3.7+"],"input_types":["language code string (e.g., 'en', 'zh', 'code_python')","split name (e.g., 'train')","optional filtering predicates"],"output_types":["filtered dataset object","streaming iterator over selected language subset","metadata about selected subset (size, document count)"],"categories":["data-processing-analysis","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"roots__cap_2","uri":"capability://data.processing.analysis.source.provenance.and.licensing.metadata.retrieval","name":"source provenance and licensing metadata retrieval","description":"ROOTS includes structured metadata for each data subset documenting original source (e.g., Wikipedia, GitHub, web crawls), license type (CC-BY, MIT, public domain), and curation decisions made by BigScience working groups. This metadata is accessible via dataset cards and supplementary documentation files, enabling users to audit data lineage, verify legal compliance, and understand potential biases introduced by source selection. The metadata structure maps each language subset to its upstream sources with explicit attribution.","intents":["Verify that training data complies with organizational or regulatory data sourcing policies","Understand which sources contributed to a specific language subset for bias analysis","Attribute data sources in model documentation and research papers","Identify and exclude data from specific sources if needed (e.g., due to license conflicts)"],"best_for":["Compliance and legal teams validating training data for regulatory requirements","Researchers studying dataset bias and source effects on model behavior","Organizations publishing models and needing transparent attribution","Teams building models for sensitive domains (healthcare, finance) requiring audit trails"],"limitations":["Metadata is descriptive but not machine-queryable at scale — requires manual inspection of documentation files","Source attribution is at the subset level, not per-document — cannot trace individual records to original sources","License information is provided but enforcement is the user's responsibility — ROOTS does not prevent use of data in violation of stated licenses","Metadata may be incomplete for some sources, especially web crawls where original URLs are not preserved"],"requires":["Access to ROOTS dataset card on Hugging Face Hub","BigScience documentation files (available in the repository)","Manual review capability for legal/compliance teams"],"input_types":["language code or subset name","optional source name filter"],"output_types":["structured metadata (source name, license, date range, document count)","attribution text for citations","licensing matrix (CSV or JSON)"],"categories":["data-processing-analysis","safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"roots__cap_3","uri":"capability://automation.workflow.distributed.streaming.access.for.large.scale.training.pipelines","name":"distributed streaming access for large-scale training pipelines","description":"ROOTS integrates with Hugging Face Datasets' streaming API, enabling distributed training systems to fetch data on-the-fly without materializing the full corpus locally. The dataset is partitioned by language, allowing multiple training nodes to load different language subsets in parallel via HTTP range requests. This architecture supports efficient distributed training on clusters with limited aggregate storage, as each node streams only its assigned language subset during training iterations.","intents":["Train on ROOTS using distributed training frameworks (PyTorch DDP, DeepSpeed) without requiring shared storage","Reduce per-node storage requirements by streaming data on-demand during training","Enable rapid iteration on model architecture without waiting for full dataset downloads","Scale training across multiple nodes with independent data loading pipelines"],"best_for":["Teams with distributed training infrastructure (multi-GPU clusters, cloud training)","Organizations with limited per-node storage but high network bandwidth","Research groups iterating rapidly on model architectures and needing fast data access"],"limitations":["Streaming introduces network latency (~10-50ms per batch fetch) compared to local disk I/O","Requires stable, high-bandwidth internet connection — not suitable for offline training or unreliable networks","Streaming performance degrades if multiple nodes fetch the same subset simultaneously — no built-in caching or CDN","Epoch-based training with streaming requires re-fetching data each epoch, increasing total network traffic"],"requires":["Hugging Face Datasets library with streaming support (datasets>=2.0.0)","Python 3.7+","Network connectivity to Hugging Face Hub (or self-hosted mirror)","PyTorch or TensorFlow with distributed training support (optional but recommended)"],"input_types":["language subset identifier","split name (train/validation)","batch size and number of workers"],"output_types":["streaming DataLoader or IterableDataset","batched tensors for model training","metadata about current batch (language, source)"],"categories":["automation-workflow","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"roots__cap_4","uri":"capability://data.processing.analysis.programming.language.code.corpus.with.language.specific.organization","name":"programming language code corpus with language-specific organization","description":"ROOTS includes 13 programming language subsets (Python, Java, C++, JavaScript, etc.) organized as separate, versioned datasets within the larger corpus. Each programming language subset is curated from sources like GitHub and Stack Overflow, with language-specific metadata (e.g., license type, repository stars). The code data is structured as raw source files with minimal preprocessing, enabling downstream models to learn language-specific syntax and idioms without artificial normalization.","intents":["Train code generation or code understanding models on diverse programming languages","Build language-specific models (e.g., Python-only) by selecting relevant code subsets","Evaluate model performance on code tasks across multiple languages","Understand the composition and quality of code data used in BLOOM's training"],"best_for":["ML researchers building code-focused language models or code completion tools","Teams training models for specific programming languages","Organizations studying code model bias and performance across languages"],"limitations":["Code data is raw source without semantic parsing — no AST-level structure or code quality filtering","License information for code is less granular than natural language subsets — some code may have unclear licensing","No deduplication of code snippets — identical functions may appear multiple times across repositories","Code data may include low-quality or malicious code (e.g., from Stack Overflow answers) without filtering"],"requires":["Hugging Face Datasets library","Python 3.7+","Knowledge of programming language codes used in ROOTS"],"input_types":["programming language code (e.g., 'code_python', 'code_java')","split name (train/validation)"],"output_types":["raw source code text","structured records with metadata (language, source, license)","streaming iterables for training"],"categories":["data-processing-analysis","code-generation-editing"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"roots__cap_5","uri":"capability://data.processing.analysis.community.driven.data.curation.and.governance.documentation","name":"community-driven data curation and governance documentation","description":"ROOTS was assembled through BigScience working groups organized by language and domain, where community members made explicit curation decisions about which sources to include, how to weight languages, and how to handle licensing conflicts. These decisions are documented in published working group reports and dataset cards, creating an auditable record of how the dataset was constructed. The governance model enables reproducibility and allows researchers to understand the human decisions that shaped the training data.","intents":["Understand the human curation decisions that shaped ROOTS and how they may affect model behavior","Replicate or extend ROOTS by following the documented curation methodology","Contribute to future versions of ROOTS by participating in community curation processes","Audit dataset composition for potential biases introduced by curation decisions"],"best_for":["Researchers studying dataset bias and the impact of curation on model behavior","Teams building open-source datasets and seeking governance models to emulate","Organizations requiring transparent, auditable data sourcing processes","Communities interested in contributing to multilingual dataset development"],"limitations":["Governance documentation is descriptive and not machine-queryable — requires manual review to understand curation decisions","Community-driven curation is slower and more complex than centralized decision-making — may introduce inconsistencies across language groups","Documentation may be incomplete or outdated if working groups are no longer active","Curation decisions reflect the values and biases of participating communities — not universally neutral"],"requires":["Access to BigScience working group reports and dataset cards","Ability to read and interpret governance documentation","Optional: participation in BigScience community forums or working groups"],"input_types":["language or domain name","optional query about curation decisions"],"output_types":["working group reports (PDF/Markdown)","dataset cards with curation rationale","governance matrices and decision logs"],"categories":["data-processing-analysis","safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"roots__cap_6","uri":"capability://safety.moderation.community.curated.data.quality.annotations.and.bias.documentation","name":"community-curated data quality annotations and bias documentation","description":"ROOTS includes community-contributed annotations documenting known biases, quality issues, and limitations in specific sources, stored as structured metadata. These annotations are curated by BigScience and the research community, providing qualitative assessments of data quality and potential harms that complement quantitative metrics, enabling informed decisions about source inclusion.","intents":["Understand known biases and limitations in specific data sources before including them in training","Make informed decisions about excluding sources with documented quality or ethical concerns","Document known limitations of your model's training data in model cards","Contribute quality annotations for sources you've analyzed"],"best_for":["Teams building models with explicit bias and fairness considerations","Researchers studying bias in pretraining corpora","Organizations with ethical AI governance requirements"],"limitations":["Bias annotations are qualitative and subjective; no standardized bias metrics provided","Coverage is incomplete; not all sources have detailed bias documentation","Annotations reflect BigScience's perspective and may not capture all relevant concerns","No mechanism for versioning or updating annotations as understanding of biases evolves"],"requires":["Access to ROOTS documentation and metadata","Understanding of bias types and fairness concepts","Critical reading skills for interpreting qualitative annotations"],"input_types":["source name or identifier","optional bias category filter"],"output_types":["bias annotations (text)","quality assessments","recommendations for source inclusion/exclusion"],"categories":["safety-moderation","memory-knowledge"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"roots__headline","uri":"capability://model.training.multilingual.dataset.for.model.training","name":"multilingual dataset for model training","description":"ROOTS is a curated multilingual dataset designed for training language models, covering 46 natural languages and 13 programming languages with a focus on data governance and community curation.","intents":["best multilingual dataset for training","dataset for training language models","free datasets for NLP","curated datasets for AI training","multilingual datasets for machine learning"],"best_for":["NLP model training","multilingual applications"],"limitations":[],"requires":[],"input_types":[],"output_types":[],"categories":["model-training"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":57,"verified":false,"data_access_risk":"high","permissions":["Hugging Face Datasets library (datasets>=2.0.0)","Python 3.7+","~1.6TB disk space for full local copy or streaming internet connection","Hugging Face account for authenticated access to some restricted subsets","Hugging Face Datasets library with split/subset support","Knowledge of language codes used in ROOTS (ISO 639-1/3 or custom codes)","Access to ROOTS dataset card on Hugging Face Hub","BigScience documentation files (available in the repository)","Manual review capability for legal/compliance teams","Hugging Face Datasets library with streaming support (datasets>=2.0.0)"],"failure_modes":["Dataset is fixed and immutable — no ability to add new languages or reweight existing ones post-publication","Streaming from Hugging Face requires internet connectivity; full download is ~1.6TB uncompressed","Language representation is not perfectly balanced — some languages have significantly more data than others due to source availability","No built-in deduplication or quality filtering at the record level — relies on upstream source curation","Subset selection is static at load time — cannot dynamically reweight languages during training without reloading","No built-in cross-lingual deduplication — same content may appear in multiple language subsets if sourced from multilingual documents","Filtering is coarse-grained (by language code) — no finer filtering by domain, quality score, or document length within a language","Metadata is descriptive but not machine-queryable at scale — requires manual inspection of documentation files","Source attribution is at the subset level, not per-document — cannot trace individual records to original sources","License information is provided but enforcement is the user's responsibility — ROOTS does not prevent use of data in violation of stated licenses","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.7,"quality":0.8500000000000001,"ecosystem":0.3,"match_graph":0.25,"freshness":0.9,"weights":{"adoption":0.3,"quality":0.25,"ecosystem":0.1,"match_graph":0.3,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-05-24T12:16:25.061Z","last_scraped_at":null,"last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=roots","compare_url":"https://unfragile.ai/compare?artifact=roots"}},"signature":"2OO11GS+69r0hoB5/3jAcyV/R1GIRvtFrC/BdoORBARneMSnBMSDGVqjEN8N6ErYDRoGM4UmayzPPApS0XKcDQ==","signedAt":"2026-06-15T08:17:03.368Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/roots","artifact":"https://unfragile.ai/roots","verify":"https://unfragile.ai/api/v1/verify?slug=roots","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}