{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"dolma","slug":"dolma","name":"Dolma","type":"dataset","url":"https://allenai.org/dolma","page_url":"https://unfragile.ai/dolma","categories":["model-training"],"tags":[],"pricing":{"model":"free","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"dolma__cap_0","uri":"capability://data.processing.analysis.multi.source.pretraining.data.composition.with.documented.curation.rules","name":"multi-source pretraining data composition with documented curation rules","description":"Dolma aggregates 3 trillion tokens from 7 heterogeneous sources (Common Crawl, The Stack, peS2o, Project Gutenberg, Wikipedia, Wikibooks, C4) with fully documented filtering criteria, deduplication methods, and mixing ratios. The composition system enables researchers to understand exactly which data proportions and quality thresholds were applied, making training runs reproducible across different teams and hardware configurations. Data is segmented into pretraining, mid-training, and post-training pools to support staged model development.","intents":["I need to train a language model with transparent, reproducible data sourcing that I can audit and replicate","I want to understand the exact composition of training data used in state-of-the-art open models","I need a balanced mix of web text, code, academic papers, and literary content without building my own pipeline","I want to compare model performance across different data mixture ratios while keeping other variables constant"],"best_for":["LLM researchers conducting reproducible pretraining experiments","Teams building custom language models with transparency requirements","Open-source model development communities needing auditable training data","Academic institutions requiring documented data provenance for publications"],"limitations":["Dataset is a static snapshot with no versioning or update mechanism described — cannot incorporate new data sources or refresh stale web content","Fixed to 7 predefined sources with no documented mechanism for adding custom data sources or adjusting mixing ratios dynamically","Requires external training infrastructure (OlmoCore) and post-training pipeline (Open Instruct) — Dolma alone is not a complete training solution","No quantitative quality metrics or benchmark comparisons provided in documentation — quality assessment is implicit in source selection rather than explicit","Storage and bandwidth requirements unknown — no guidance on disk space, download time, or network costs for accessing full dataset","Licensing terms and commercial usage restrictions not specified in available documentation"],"requires":["Access to allenai.org/dolma download infrastructure (protocol and authentication method unspecified)","OlmoCore training framework (separate artifact) to consume and apply dataset","Sufficient storage capacity for 3 trillion tokens (exact GB/TB requirement not documented)","Understanding of data curation concepts: deduplication, filtering, mixing ratios, and training phases","Familiarity with large-scale model training pipelines and distributed training infrastructure"],"input_types":["raw web crawl data (Common Crawl)","source code repositories (The Stack)","academic paper metadata and text (peS2o)","book text (Project Gutenberg)","wiki markup and structured text (Wikipedia, Wikibooks)","filtered web text (C4)"],"output_types":["tokenized pretraining dataset","mid-training refinement dataset","post-training instruction dataset","data mixture specifications (ratios and filtering rules)","deduplication and filtering rule documentation"],"categories":["data-processing-analysis","model-training"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"dolma__cap_1","uri":"capability://data.processing.analysis.source.specific.data.filtering.and.quality.control","name":"source-specific data filtering and quality control","description":"Dolma implements source-specific filtering pipelines using documented rules applied through tools like Datamap-rs (large-scale data cleaning) and Duplodocus (fuzzy deduplication). Each of the 7 sources undergoes tailored quality filtering appropriate to its characteristics: web crawl data is filtered for language and content quality, code is filtered for license and syntax validity, academic papers are filtered by venue quality, and literary text is filtered for encoding and completeness. Filtering rules are explicitly documented to enable researchers to understand and potentially modify quality thresholds.","intents":["I need to understand what quality filters were applied to each data source so I can assess potential biases or gaps","I want to remove low-quality, duplicate, or malicious content from my training data without building custom filtering infrastructure","I need to apply consistent quality standards across heterogeneous data sources (web, code, academic, books) with different characteristics","I want to reproduce the exact filtering decisions used in OLMo training to validate model behavior"],"best_for":["Data engineers building custom training datasets who need reference implementations for quality filtering","Researchers studying the impact of data quality on model performance","Teams concerned about training on low-quality, toxic, or license-violating content","Reproducibility-focused projects requiring auditable data cleaning pipelines"],"limitations":["Filtering rules are documented but not parameterized — no mechanism to adjust thresholds or disable specific filters without rebuilding the dataset","Deduplication method (Duplodocus) is a separate tool with unknown computational cost and memory requirements","No quantitative metrics on filtering impact (e.g., percentage of data removed per source, quality score distributions)","Filtering rules are source-specific and may not generalize to custom data sources added by users","Unknown whether filtering is applied once at dataset creation or continuously during training"],"requires":["Datamap-rs tool for large-scale data cleaning (separate artifact, requirements unknown)","Duplodocus tool for fuzzy deduplication (separate artifact, requirements unknown)","Understanding of source-specific quality criteria (language detection, code syntax validation, venue ranking, etc.)","Computational resources to run filtering pipelines on multi-trillion-token dataset"],"input_types":["raw web crawl documents","source code files with metadata","academic paper text and metadata","book text with encoding information","wiki markup and revision history"],"output_types":["filtered document corpus","deduplication mapping (original → canonical document)","quality score distributions per source","filtering rule specifications and thresholds"],"categories":["data-processing-analysis","safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"dolma__cap_10","uri":"capability://automation.workflow.post.training.data.pipeline.integration.with.open.instruct.for.instruction.tuning","name":"post-training data pipeline integration with open instruct for instruction tuning","description":"Dolma's post-training data pool is designed for use with Open Instruct, Allen AI's instruction tuning framework, enabling seamless transition from pretraining to instruction tuning. The post-training pool contains instruction-formatted data (format unspecified) optimized for alignment and capability refinement. Integration with Open Instruct provides data loading, instruction formatting, and training orchestration for the post-training phase. This integration enables researchers to implement the full training pipeline (pretraining → continued pretraining → instruction tuning) using coordinated Dolma and Open Instruct components.","intents":["I want to apply instruction tuning to a model trained on Dolma pretraining data using a proven, open-source framework","I need post-training data that is optimized for alignment and instruction following without manual curation","I want to implement the full training pipeline from pretraining through instruction tuning using coordinated tools","I need to reproduce OLMo model training including the post-training phase"],"best_for":["Teams implementing full training pipelines (pretraining → instruction tuning) using Dolma and Open Instruct","Researchers studying the impact of instruction tuning on model capabilities and alignment","Open-source projects requiring integrated pretraining and post-training infrastructure","Developers building instruction-tuned models on top of Dolma pretraining"],"limitations":["Post-training data pool composition is not documented — unclear what sources or formats are included","Integration is specific to Open Instruct — using Dolma post-training data with other instruction tuning frameworks requires custom adaptation","Open Instruct requirements and capabilities are unknown from available materials","No documentation of instruction data quality, diversity, or coverage","Unknown whether post-training pool is disjoint from pretraining pool or overlapping","Requires learning Open Instruct API and configuration — not compatible with existing instruction tuning pipelines","Unknown whether Open Instruct is actively maintained and supported"],"requires":["Open Instruct framework (separate artifact, requirements unknown)","Model checkpoint from pretraining phase (trained on Dolma pretraining pool)","Understanding of instruction tuning concepts and alignment","Computational resources for instruction tuning (GPUs/TPUs)","Familiarity with Open Instruct API and configuration"],"input_types":["Dolma post-training data pool (instruction-formatted, format unspecified)","pretrained model checkpoint","instruction tuning configuration (learning rate, batch size, number of steps, etc.)"],"output_types":["instruction-tuned model checkpoint","training logs and metrics","evaluation results (instruction following, alignment metrics)","model artifacts (weights, config, tokenizer)"],"categories":["automation-workflow","tool-use-integration"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"dolma__cap_2","uri":"capability://data.processing.analysis.staged.training.data.segmentation.for.pretraining.mid.training.and.post.training.phases","name":"staged training data segmentation for pretraining, mid-training, and post-training phases","description":"Dolma provides three distinct data pools optimized for different training stages: a pretraining pool for initial model training on diverse, general-purpose text; a mid-training pool for continued pretraining with potentially different source ratios or quality thresholds; and a post-training pool for instruction tuning and alignment. This segmentation enables researchers to apply different data compositions at different training phases without managing separate datasets, and allows for staged training strategies where model behavior is refined through targeted data exposure.","intents":["I want to train a model with different data compositions at different training stages (e.g., general pretraining followed by code-focused continued training)","I need to apply instruction tuning data that is separate from and optimized differently than pretraining data","I want to study how data composition at different training phases affects model capabilities and alignment","I need a dataset that supports the full training pipeline from pretraining through post-training without managing multiple separate datasets"],"best_for":["Teams implementing multi-stage training strategies (pretraining → continued pretraining → instruction tuning)","Researchers studying the impact of training phase on model behavior and capabilities","Open-source model developers using the OlmoCore training framework","Projects requiring reproducible, staged training pipelines with documented data composition at each stage"],"limitations":["Segmentation into three pools is fixed — no mechanism to create custom training phases or adjust phase boundaries","Composition of each pool (source ratios, filtering rules) is not independently documented — unclear how mid-training and post-training pools differ from pretraining","No guidance on optimal training duration or data quantity for each phase","Requires external training framework (OlmoCore) to implement staged training — Dolma provides data but not orchestration","Unknown whether pools are disjoint (no data overlap) or overlapping (some data used in multiple phases)"],"requires":["OlmoCore training framework to implement staged training","Understanding of multi-stage training strategies and when to transition between phases","Separate post-training pipeline (Open Instruct) for instruction tuning phase","Computational resources to train through all three phases"],"input_types":["pretraining pool: diverse text from all 7 sources","mid-training pool: potentially source-weighted or quality-filtered subset","post-training pool: instruction-formatted data (format unspecified)"],"output_types":["pretraining checkpoint (after initial training phase)","mid-training checkpoint (after continued pretraining)","post-training checkpoint (after instruction tuning)","training phase specifications and data composition per phase"],"categories":["data-processing-analysis","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"dolma__cap_3","uri":"capability://memory.knowledge.data.provenance.tracing.from.trained.models.back.to.source.documents","name":"data provenance tracing from trained models back to source documents","description":"Dolma integrates with the OlmoTrace tool, which enables researchers to trace model outputs and behaviors back to the specific source documents in the training dataset that contributed to those outputs. This capability works by maintaining mappings between training data and model internals, allowing queries like 'which documents influenced this model's response?' or 'what is the source distribution of training data for this capability?'. Traceability is implemented through document-level tracking during preprocessing and training, enabling post-hoc analysis of model behavior in terms of training data composition.","intents":["I want to understand which training documents influenced a specific model output or behavior","I need to audit model training data to identify potential sources of bias, toxicity, or copyright violations","I want to study how different source documents (web vs. code vs. academic) contribute to different model capabilities","I need to trace model failures or hallucinations back to their training data origins for debugging"],"best_for":["AI safety and alignment researchers studying model behavior in terms of training data","Teams auditing models for bias, toxicity, or copyright concerns","Researchers studying the relationship between training data and model capabilities","Open-source model developers requiring transparency and accountability in training"],"limitations":["OlmoTrace is a separate tool with unknown computational cost, latency, and scalability","Tracing mechanism is not described in detail — unclear whether it traces individual tokens, documents, or source domains","No quantitative metrics on tracing accuracy or completeness","Tracing is post-hoc (after training) — cannot be used to guide training in real-time","Unknown whether tracing works for all model sizes (OLMo 7B, 32B) or only specific variants","Requires trained model and access to OlmoTrace tool — not applicable to arbitrary models trained on Dolma"],"requires":["OlmoTrace tool (separate artifact, requirements unknown)","Trained model (e.g., OLMo 7B or 32B) with provenance metadata","Understanding of model internals and how to interpret tracing results","Access to original Dolma source documents for comparison"],"input_types":["model output or behavior (text, logits, attention patterns)","trained model checkpoint with provenance metadata"],"output_types":["source document identifiers and text","source distribution (percentage of output influenced by each source)","document-level contribution scores","source domain analysis (web vs. code vs. academic contribution)"],"categories":["memory-knowledge","safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"dolma__cap_4","uri":"capability://data.processing.analysis.code.specific.data.extraction.and.quality.filtering.from.the.stack","name":"code-specific data extraction and quality filtering from the stack","description":"Dolma incorporates The Stack, a large-scale source code dataset, with code-specific filtering and quality control. Code data is filtered for license compliance (removing GPL and other restrictive licenses), syntax validity, and repository quality. The Stack integration provides access to diverse programming languages and coding patterns without requiring separate code dataset curation. Code is deduplicated using the same Duplodocus fuzzy deduplication as other sources, enabling detection of near-duplicate code across repositories.","intents":["I want to train a model with code data that respects open-source licenses and avoids GPL-licensed code","I need access to diverse, high-quality source code across multiple programming languages without building my own code dataset","I want to study how code data composition affects model coding capabilities","I need to ensure code training data is syntactically valid and from reputable repositories"],"best_for":["Teams training code-capable language models (e.g., code completion, code generation)","Researchers studying the impact of code data on model capabilities","Open-source projects concerned about license compliance in training data","Developers building models that need to avoid GPL-licensed code for commercial use"],"limitations":["The Stack is a fixed snapshot — no mechanism to add new repositories or refresh stale code","License filtering removes GPL and other restrictive licenses, reducing code diversity for some use cases","Syntax validation rules are not documented — unclear which languages are supported or how invalid syntax is handled","No metrics on code quality, repository maturity, or language distribution","Code deduplication may remove legitimate code variations or refactorings","Unknown whether code is deduplicated only within The Stack or across all Dolma sources"],"requires":["Understanding of open-source licenses and license compliance requirements","Familiarity with source code structure and programming language syntax","The Stack dataset (integrated into Dolma, but original source is separate artifact)"],"input_types":["source code files from public repositories","repository metadata (language, license, stars, etc.)","code comments and documentation"],"output_types":["filtered, deduplicated code corpus","code by programming language (distribution unknown)","license-compliant code subset","code quality metrics (syntax validity, repository maturity)"],"categories":["data-processing-analysis","code-generation-editing"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"dolma__cap_5","uri":"capability://data.processing.analysis.academic.paper.text.extraction.and.venue.based.quality.filtering.via.pes2o","name":"academic paper text extraction and venue-based quality filtering via pes2o","description":"Dolma incorporates peS2o, a large-scale academic paper dataset, with venue-based quality filtering that prioritizes papers from high-impact conferences and journals. Academic papers are filtered by publication venue quality (e.g., top-tier conferences, high-impact journals) rather than citation count or other metrics, ensuring training data includes rigorous, peer-reviewed research. Paper text is extracted from PDFs and structured metadata, enabling models to learn from scientific writing and domain-specific knowledge. Academic data is deduplicated using the same fuzzy deduplication as other sources.","intents":["I want to train a model with high-quality academic and scientific content without manually curating papers","I need to ensure training data includes peer-reviewed research from reputable venues","I want to study how academic data composition affects model knowledge and reasoning capabilities","I need access to diverse scientific domains and research methodologies in training data"],"best_for":["Teams training models for scientific and technical applications (e.g., scientific Q&A, research summarization)","Researchers studying the impact of academic data on model knowledge and reasoning","Projects requiring high-quality, peer-reviewed content in training data","Developers building models that need to understand scientific concepts and methodologies"],"limitations":["Venue-based filtering may introduce bias toward certain research communities or methodologies (e.g., favoring empirical over theoretical work)","Venue quality rankings are not documented — unclear which conferences/journals are considered 'high-impact'","Paper extraction from PDFs is lossy — figures, tables, and equations may be missing or corrupted","No metrics on paper quality, citation impact, or research novelty","Academic data is a fixed snapshot — no mechanism to add new papers or refresh with recent research","Unknown whether papers are deduplicated across versions (preprints vs. published versions)"],"requires":["peS2o dataset (integrated into Dolma, but original source is separate artifact)","Understanding of academic publishing and venue quality rankings","Familiarity with scientific writing and domain-specific terminology"],"input_types":["academic paper PDFs and metadata","paper text, abstracts, and citations","venue information (conference, journal, year)"],"output_types":["filtered academic paper corpus","papers by research domain (distribution unknown)","venue-filtered subset (high-impact venues only)","paper metadata and citation information"],"categories":["data-processing-analysis","text-generation-language"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"dolma__cap_6","uri":"capability://data.processing.analysis.web.text.filtering.and.deduplication.across.common.crawl.and.c4.sources","name":"web text filtering and deduplication across common crawl and c4 sources","description":"Dolma integrates web text from both Common Crawl (raw web crawl) and C4 (pre-filtered web text), with documented filtering rules for language detection, content quality, and toxicity. Web data undergoes source-specific filtering appropriate to its characteristics: Common Crawl data is filtered more aggressively due to lower baseline quality, while C4 data benefits from existing filtering. All web data is deduplicated using Duplodocus fuzzy deduplication to remove near-duplicate content across domains. The combination of two web sources with different filtering approaches provides diversity while maintaining quality standards.","intents":["I want to train a model with diverse, high-quality web text without building my own web crawl pipeline","I need to understand what filtering was applied to web data to assess potential biases or gaps","I want to study how web data composition affects model knowledge and language capabilities","I need to ensure web training data is deduplicated and free of low-quality or toxic content"],"best_for":["Teams training general-purpose language models with web data","Researchers studying the impact of web data on model capabilities","Projects concerned about training on low-quality or toxic web content","Developers building models that need broad knowledge from diverse web sources"],"limitations":["Web data filtering rules are documented but not parameterized — no mechanism to adjust quality thresholds","Language detection method is not specified — unclear which languages are included or excluded","Toxicity filtering rules are not detailed — unclear what content is considered toxic or how it is detected","No quantitative metrics on filtering impact (e.g., percentage of web data removed, quality score distributions)","Web data is a static snapshot — no mechanism to refresh with new web content or remove outdated information","Deduplication may remove legitimate content variations or paraphrases","Unknown whether Common Crawl and C4 data are deduplicated against each other or separately"],"requires":["Common Crawl and C4 datasets (integrated into Dolma, but original sources are separate artifacts)","Understanding of web content quality and toxicity detection","Datamap-rs tool for large-scale data cleaning (separate artifact, requirements unknown)"],"input_types":["raw web crawl documents (Common Crawl)","pre-filtered web text (C4)","document metadata (language, domain, quality scores)"],"output_types":["filtered, deduplicated web text corpus","web data by language and domain (distribution unknown)","quality-filtered subset (high-quality content only)","deduplication mapping (duplicate → canonical document)"],"categories":["data-processing-analysis","safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"dolma__cap_7","uri":"capability://data.processing.analysis.literary.and.reference.text.integration.from.project.gutenberg.wikipedia.and.wikibooks","name":"literary and reference text integration from project gutenberg, wikipedia, and wikibooks","description":"Dolma incorporates literary and reference text from Project Gutenberg (public domain books), Wikipedia (encyclopedia articles), and Wikibooks (educational textbooks), providing access to structured, high-quality, and diverse written content. Literary data is filtered for completeness and encoding validity, while Wikipedia and Wikibooks data are filtered for article quality and relevance. These sources provide models with exposure to diverse writing styles, narrative structures, and domain-specific knowledge without requiring separate curation. All sources are deduplicated using Duplodocus fuzzy deduplication.","intents":["I want to train a model with diverse literary and reference content to improve language understanding and writing quality","I need access to structured, high-quality encyclopedia and textbook content without manual curation","I want to study how literary data composition affects model language capabilities and writing style","I need to ensure training data includes diverse writing styles and narrative structures"],"best_for":["Teams training general-purpose language models with emphasis on language quality and diversity","Researchers studying the impact of literary data on model language capabilities","Projects requiring high-quality, structured reference content in training data","Developers building models that need to understand diverse writing styles and domains"],"limitations":["Project Gutenberg data is limited to public domain works (pre-1923 in most cases), potentially introducing historical bias","Wikipedia and Wikibooks data may contain outdated information or biases in article coverage","Article quality filtering rules are not documented — unclear which articles are included or excluded","No metrics on literary diversity, writing quality, or domain coverage","Literary data is a static snapshot — no mechanism to add new works or refresh with recent publications","Deduplication may remove legitimate content variations or translations","Unknown whether Wikipedia and Wikibooks are deduplicated against each other or separately"],"requires":["Project Gutenberg, Wikipedia, and Wikibooks datasets (integrated into Dolma, but original sources are separate artifacts)","Understanding of literary content and reference material quality","Familiarity with diverse writing styles and domains"],"input_types":["public domain book text (Project Gutenberg)","encyclopedia articles and metadata (Wikipedia)","educational textbook content (Wikibooks)"],"output_types":["filtered, deduplicated literary and reference corpus","content by domain and writing style (distribution unknown)","quality-filtered subset (high-quality articles only)","structured metadata (article titles, categories, links)"],"categories":["data-processing-analysis","text-generation-language"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"dolma__cap_8","uri":"capability://data.processing.analysis.dataset.reproducibility.and.version.control.through.documented.curation.specifications","name":"dataset reproducibility and version control through documented curation specifications","description":"Dolma provides comprehensive documentation of all data curation decisions, including exact filtering rules, deduplication methods, source mixing ratios, and training phase specifications. This documentation enables researchers to reproduce the dataset independently or modify specific curation steps without rebuilding from scratch. The specification-driven approach treats data curation as a reproducible process rather than a black box, allowing other teams to validate, audit, or extend the dataset. Documentation is released alongside trained models (OLMo family) to enable validation of training reproducibility.","intents":["I want to reproduce the exact training data used in OLMo models to validate model behavior or train similar models","I need to understand and potentially modify specific data curation steps without rebuilding the entire dataset","I want to audit data curation decisions to identify potential biases or gaps","I need to publish research with fully reproducible training data specifications"],"best_for":["Researchers conducting reproducible LLM training experiments","Teams validating model training claims or reproducing published results","Open-source projects requiring transparent, auditable data curation","Academic institutions publishing research with reproducible training data"],"limitations":["Documentation completeness is unknown — specific filtering rules, deduplication parameters, and mixing ratios are claimed but not shown in available materials","No formal specification language or schema for curation rules — documentation may be prose-based and difficult to parse programmatically","No version control or change tracking for curation specifications — unclear how updates or corrections are managed","Reproducibility requires access to all 7 source datasets and preprocessing tools (Datamap-rs, Duplodocus) — full reproduction may be infeasible for some teams","No automated validation or testing of curation specifications — unclear whether documented rules match actual implementation","Documentation may become outdated as tools and sources evolve"],"requires":["Access to all 7 source datasets (Common Crawl, The Stack, peS2o, Project Gutenberg, Wikipedia, Wikibooks, C4)","Preprocessing tools: Datamap-rs (data cleaning) and Duplodocus (deduplication)","Understanding of data curation concepts and filtering rules","Computational resources to run curation pipelines on multi-trillion-token dataset","Access to Dolma documentation (format and completeness unknown)"],"input_types":["curation specifications (filtering rules, deduplication parameters, mixing ratios)","source datasets (raw or pre-processed)","training phase definitions"],"output_types":["reproduced dataset (identical to original Dolma)","modified dataset (with custom curation rules)","curation audit report (validation of specifications)","reproducibility metrics (comparison to original dataset)"],"categories":["data-processing-analysis","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"dolma__cap_9","uri":"capability://automation.workflow.integration.with.olmocore.training.framework.for.end.to.end.model.training","name":"integration with olmocore training framework for end-to-end model training","description":"Dolma is designed as a native data source for OlmoCore, Allen AI's open-source training framework, enabling seamless integration from data loading through model checkpointing. The integration includes optimized data loading pipelines, distributed training support, and checkpoint management that work directly with Dolma's data format and structure. OlmoCore handles tokenization, batching, and training orchestration while consuming Dolma data, eliminating the need for custom data pipeline engineering. The integration enables researchers to train models using Dolma without building custom infrastructure.","intents":["I want to train a language model using Dolma data without building custom data loading and training infrastructure","I need to use a proven, open-source training framework that is optimized for Dolma's data structure","I want to reproduce OLMo model training or train similar models using the same framework and data","I need distributed training support and checkpoint management integrated with my data pipeline"],"best_for":["Teams training language models using Dolma data with OlmoCore framework","Researchers reproducing or extending OLMo model training","Open-source projects requiring integrated data and training infrastructure","Developers building custom models on top of OlmoCore"],"limitations":["Integration is specific to OlmoCore — using Dolma with other training frameworks (PyTorch Lightning, Hugging Face Transformers, etc.) requires custom data loading code","OlmoCore requirements and capabilities are unknown from available materials","No documentation of data loading performance, throughput, or scalability","Unknown whether OlmoCore supports all Dolma features (staged training, source-specific filtering, etc.) or only basic data loading","Requires learning OlmoCore API and configuration — not compatible with existing training pipelines using other frameworks","Unknown whether OlmoCore is actively maintained and supported"],"requires":["OlmoCore training framework (separate artifact, requirements unknown)","Understanding of OlmoCore API and configuration","Computational resources for distributed training (GPUs/TPUs)","Familiarity with model training concepts (learning rates, batch sizes, checkpointing, etc.)"],"input_types":["Dolma dataset (pretraining, mid-training, or post-training pool)","model configuration (architecture, hyperparameters)","training configuration (batch size, learning rate, number of steps, etc.)"],"output_types":["trained model checkpoint","training logs and metrics","evaluation results","model artifacts (weights, config, tokenizer)"],"categories":["automation-workflow","tool-use-integration"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"dolma__headline","uri":"capability://model.training.large.scale.language.model.training.dataset","name":"large-scale language model training dataset","description":"Dolma is an extensive open dataset containing 3 trillion tokens, specifically curated for training the OLMo family of language models, making it ideal for researchers and developers in AI and NLP.","intents":["best dataset for training language models","open datasets for NLP research","large-scale datasets for AI model training","datasets for OLMo model training","best resources for language model development"],"best_for":["NLP researchers","AI developers"],"limitations":["requires significant computational resources"],"requires":["familiarity with machine learning"],"input_types":[],"output_types":[],"categories":["model-training"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":58,"verified":false,"data_access_risk":"high","permissions":["Access to allenai.org/dolma download infrastructure (protocol and authentication method unspecified)","OlmoCore training framework (separate artifact) to consume and apply dataset","Sufficient storage capacity for 3 trillion tokens (exact GB/TB requirement not documented)","Understanding of data curation concepts: deduplication, filtering, mixing ratios, and training phases","Familiarity with large-scale model training pipelines and distributed training infrastructure","Datamap-rs tool for large-scale data cleaning (separate artifact, requirements unknown)","Duplodocus tool for fuzzy deduplication (separate artifact, requirements unknown)","Understanding of source-specific quality criteria (language detection, code syntax validation, venue ranking, etc.)","Computational resources to run filtering pipelines on multi-trillion-token dataset","Open Instruct framework (separate artifact, requirements unknown)"],"failure_modes":["Dataset is a static snapshot with no versioning or update mechanism described — cannot incorporate new data sources or refresh stale web content","Fixed to 7 predefined sources with no documented mechanism for adding custom data sources or adjusting mixing ratios dynamically","Requires external training infrastructure (OlmoCore) and post-training pipeline (Open Instruct) — Dolma alone is not a complete training solution","No quantitative quality metrics or benchmark comparisons provided in documentation — quality assessment is implicit in source selection rather than explicit","Storage and bandwidth requirements unknown — no guidance on disk space, download time, or network costs for accessing full dataset","Licensing terms and commercial usage restrictions not specified in available documentation","Filtering rules are documented but not parameterized — no mechanism to adjust thresholds or disable specific filters without rebuilding the dataset","Deduplication method (Duplodocus) is a separate tool with unknown computational cost and memory requirements","No quantitative metrics on filtering impact (e.g., percentage of data removed per source, quality score distributions)","Filtering rules are source-specific and may not generalize to custom data sources added by users","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.7,"quality":0.9,"ecosystem":0.3,"match_graph":0.25,"freshness":0.75,"weights":{"adoption":0.3,"quality":0.25,"ecosystem":0.1,"match_graph":0.3,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-05-24T12:16:21.548Z","last_scraped_at":null,"last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=dolma","compare_url":"https://unfragile.ai/compare?artifact=dolma"}},"signature":"QcYEtYoZu4IAE9tf+VPSbPpKLEz74/vg7CZKnMr2EFh098dShdbriKL6HY3RAQR4pWIac+jAgAgXaRZHPCwKDg==","signedAt":"2026-06-21T02:04:48.975Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/dolma","artifact":"https://unfragile.ai/dolma","verify":"https://unfragile.ai/api/v1/verify?slug=dolma","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}