{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"starcoderdata","slug":"starcoderdata","name":"StarCoderData","type":"dataset","url":"https://huggingface.co/datasets/bigcode/starcoderdata","page_url":"https://unfragile.ai/starcoderdata","categories":["model-training","testing-quality"],"tags":[],"pricing":{"model":"free","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"starcoderdata__cap_0","uri":"capability://data.processing.analysis.multi.language.code.dataset.curation.with.near.deduplication","name":"multi-language code dataset curation with near-deduplication","description":"Processes raw code from The Stack (a 3TB+ dataset) through a multi-stage filtering pipeline that applies near-deduplication heuristics (likely MinHash or similar probabilistic techniques) to identify and remove near-identical code blocks across 86 programming languages. The curation preserves language-specific semantics while reducing redundancy, enabling models trained on this data to learn diverse coding patterns rather than memorizing repetitive boilerplate. Outputs a deduplicated 250GB subset suitable for model pretraining.","intents":["I need a large, deduplicated code dataset to pretrain a code LLM without wasting compute on redundant examples","I want to ensure my model learns diverse coding patterns across 86 languages, not just memorize common boilerplate","I need to reduce dataset size while maintaining quality and language coverage for efficient training"],"best_for":["ML researchers training code foundation models from scratch","organizations building domain-specific code LLMs with limited compute budgets","teams needing a baseline dataset for transfer learning or fine-tuning"],"limitations":["Near-deduplication is probabilistic — some similar code may remain; exact deduplication would require O(n²) comparisons","250GB is still large; requires significant storage and bandwidth for download/processing","Language distribution may be imbalanced (e.g., Python/JavaScript likely overrepresented vs niche languages)","Deduplication thresholds are fixed — no tuning for domain-specific redundancy tolerance"],"requires":["250GB+ disk space for full dataset","Hugging Face account with dataset access permissions","Network bandwidth for downloading 250GB (or access to cached mirrors)","Python 3.7+ with datasets library for programmatic access"],"input_types":["raw code files from The Stack (multiple formats: .py, .js, .java, .rs, etc.)","GitHub metadata (issue descriptions, commit messages)"],"output_types":["deduplicated code samples (text/code format)","structured dataset splits (train/validation/test)","language-tagged code examples with metadata"],"categories":["data-processing-analysis","model-training"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"starcoderdata__cap_1","uri":"capability://safety.moderation.pii.removal.and.privacy.preserving.code.filtering","name":"pii removal and privacy-preserving code filtering","description":"Applies automated PII (Personally Identifiable Information) detection and removal across the dataset, scanning for patterns like email addresses, API keys, credentials, and personal names embedded in code comments or strings. Uses regex-based and potentially ML-based classifiers to identify sensitive data, then either redacts or removes affected code samples. This ensures the resulting dataset is safe for public distribution and model training without leaking private information.","intents":["I need to ensure my training dataset doesn't contain leaked credentials, API keys, or personal information","I want to publicly release a code dataset without legal/privacy liability from PII exposure","I need to filter out code samples that contain email addresses, phone numbers, or personal names in comments"],"best_for":["organizations publishing open-source datasets and models","teams training models for commercial use where data provenance matters","researchers ensuring GDPR/privacy compliance in ML pipelines"],"limitations":["PII detection is not perfect — some obfuscated or domain-specific sensitive data may slip through","Overly aggressive filtering may remove legitimate code (e.g., example email addresses in documentation)","Regex-based detection doesn't understand context — may flag false positives in variable names or test data","No transparency into which specific PII patterns are detected or how redaction is applied"],"requires":["PII detection library or custom regex patterns (implementation details not publicly documented)","Ability to parse and modify code without breaking syntax","Computational overhead for scanning 250GB of code (likely batched/parallelized)"],"input_types":["raw code files with embedded comments, strings, and metadata"],"output_types":["code samples with PII redacted or removed","metadata flags indicating which samples were filtered"],"categories":["safety-moderation","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"starcoderdata__cap_2","uri":"capability://data.processing.analysis.quality.filtering.and.code.validity.assessment","name":"quality filtering and code validity assessment","description":"Implements heuristic-based quality filtering to exclude low-quality, malformed, or non-functional code samples from the dataset. Likely uses metrics such as: file size thresholds (excluding very small or very large files), syntax validity checks (parsing code to ensure it's well-formed), license filtering (excluding code with restrictive licenses), and potentially code complexity or style metrics. Filters are applied per-language to respect language-specific conventions (e.g., Python indentation rules vs. JavaScript semicolons).","intents":["I want to train on high-quality, syntactically valid code rather than snippets or malformed examples","I need to exclude code with restrictive licenses or unclear provenance","I want to filter out generated or auto-formatted code that doesn't represent real coding patterns"],"best_for":["teams training code models where output quality directly impacts downstream applications","organizations concerned with license compliance and legal provenance","researchers studying real-world coding patterns (not synthetic or auto-generated code)"],"limitations":["Quality metrics are heuristic-based and may not correlate with actual code usefulness for model training","Syntax validity checks require language-specific parsers — some languages may be under-represented if parsing is incomplete","License filtering may be overly conservative (e.g., excluding MIT-licensed code if metadata is ambiguous)","No human review — automated filtering can miss context-dependent quality issues"],"requires":["Language-specific parsers or linters for 86 languages (tree-sitter, language-specific AST tools)","License detection library (e.g., licensename, SPDX metadata parsing)","Configurable thresholds for file size, complexity, and other metrics"],"input_types":["raw code files with metadata (file size, language, license info)"],"output_types":["filtered code samples marked as 'high-quality'","exclusion metadata (reason for filtering: license, syntax error, size, etc.)"],"categories":["data-processing-analysis","safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"starcoderdata__cap_3","uri":"capability://data.processing.analysis.multi.language.code.representation.and.tokenization","name":"multi-language code representation and tokenization","description":"Provides code samples across 86 programming languages with language-aware metadata and tokenization support. Each sample is tagged with its language, enabling downstream models to learn language-specific patterns and syntax. The dataset structure supports efficient loading and batching of code by language, allowing models to train on language-balanced or language-specific subsets. Tokenization is deferred to the model training pipeline, but the dataset preserves raw code to enable flexible tokenizer choices.","intents":["I need a dataset with code from 86 languages to train a polyglot code model","I want to sample code by language for balanced training or language-specific fine-tuning","I need to preserve raw code (not pre-tokenized) so I can use my own tokenizer"],"best_for":["researchers training multilingual code models (like StarCoder)","teams building language-specific code tools that need diverse training data","organizations experimenting with different tokenization strategies"],"limitations":["Language distribution is likely imbalanced (Python/JavaScript overrepresented, niche languages underrepresented)","No built-in tokenization — requires downstream tokenizer configuration","Language detection/tagging may be imperfect for polyglot files or embedded code","No language-specific preprocessing (e.g., removing language-specific comments or docstrings)"],"requires":["Language detection logic (likely based on file extension or MIME type, not content analysis)","Support for 86 language parsers/syntax definitions in downstream training pipeline","Tokenizer compatible with code (e.g., GPT-2 tokenizer, custom code tokenizer)"],"input_types":["code files in 86 programming languages","language metadata (file extension, explicit language tag)"],"output_types":["code samples with language tags","language-stratified dataset splits","raw code text (not pre-tokenized)"],"categories":["data-processing-analysis","code-generation-editing"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"starcoderdata__cap_4","uri":"capability://data.processing.analysis.github.context.integration.issues.commits.and.code.relationships","name":"github context integration (issues, commits, and code relationships)","description":"Augments raw code samples with GitHub metadata including issue descriptions, commit messages, and code change history. This provides semantic context for code snippets, enabling models to learn the relationship between code changes and their motivations/descriptions. The dataset likely includes paired examples of (code, issue description) or (code change, commit message), enriching the training signal beyond syntax-only learning. Enables training on code-to-text and text-to-code tasks simultaneously.","intents":["I want to train a model that understands the relationship between code and natural language descriptions (issues, commits)","I need paired code-and-description examples for code summarization or code search tasks","I want to improve code generation by conditioning on issue descriptions or commit messages"],"best_for":["researchers training code-to-text or text-to-code models","teams building code search or code recommendation systems","organizations developing code summarization or documentation generation tools"],"limitations":["GitHub metadata quality varies — issue descriptions may be vague, incomplete, or in multiple languages","Commit messages are often low-quality or non-descriptive ('fix bug', 'update')","Pairing code with issues/commits requires heuristics (e.g., matching file changes to commit messages) — may introduce noise","No guarantee that issue descriptions accurately reflect the code changes","Privacy concerns: GitHub issues may contain sensitive information even after PII filtering"],"requires":["GitHub API access or pre-downloaded GitHub data (The Stack includes this)","Heuristics for matching code changes to issues/commits","Natural language processing to extract relevant issue/commit text"],"input_types":["code files with associated GitHub metadata","issue descriptions (text)","commit messages (text)","code diffs (showing changes)"],"output_types":["paired (code, issue description) examples","paired (code change, commit message) examples","code with contextual metadata"],"categories":["data-processing-analysis","text-generation-language"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"starcoderdata__cap_5","uri":"capability://data.processing.analysis.dataset.versioning.and.reproducible.splits","name":"dataset versioning and reproducible splits","description":"Provides versioned snapshots of the curated dataset with reproducible train/validation/test splits, enabling researchers to compare results across experiments and publications. Uses deterministic splitting logic (likely based on file hashes or fixed random seeds) to ensure the same code samples appear in the same splits across different downloads. Metadata includes dataset version, curation date, and filtering parameters, enabling reproducibility and ablation studies.","intents":["I need reproducible dataset splits so my results are comparable across experiments and publications","I want to know exactly which code samples are in my training set and how they were selected","I need to ablate different filtering steps to understand their impact on model performance"],"best_for":["researchers publishing papers with code models (reproducibility requirement)","teams conducting ablation studies on dataset curation steps","organizations benchmarking different code models on the same data"],"limitations":["Versioning adds complexity — old versions may become unavailable or deprecated","Reproducible splits require fixed random seeds — can't easily add new data without breaking splits","No built-in support for cross-validation or stratified sampling by language","Dataset size (250GB) makes it impractical to re-download for minor version updates"],"requires":["Hugging Face Datasets library with versioning support","Deterministic splitting logic (fixed seeds, hash-based sampling)","Metadata tracking (version, curation date, filtering parameters)"],"input_types":["curated code dataset"],"output_types":["versioned dataset snapshots","reproducible train/validation/test splits","metadata (version, curation parameters, split statistics)"],"categories":["data-processing-analysis","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"starcoderdata__cap_6","uri":"capability://data.processing.analysis.efficient.dataset.streaming.and.lazy.loading","name":"efficient dataset streaming and lazy loading","description":"Implements streaming-based data loading via Hugging Face Datasets library, enabling researchers to train on the full 250GB dataset without downloading it entirely upfront. Uses lazy loading and on-the-fly batching to load code samples into memory as needed, reducing storage requirements and enabling training on machines with limited disk space. Supports efficient sampling, shuffling, and filtering operations without materializing the full dataset.","intents":["I want to train on the full 250GB dataset but my machine only has 100GB of disk space","I need to sample and shuffle code examples efficiently without loading the entire dataset into memory","I want to filter the dataset dynamically (e.g., by language) during training without pre-processing"],"best_for":["researchers with limited disk/memory resources","teams training on cloud infrastructure with per-sample billing","organizations experimenting with different dataset subsets without full downloads"],"limitations":["Streaming adds network latency — slower than local disk access, especially for random sampling","Requires stable internet connection — interruptions can break training runs","Caching behavior is opaque — unclear which samples are cached locally vs. fetched remotely","Shuffling across the full dataset requires multiple passes, increasing latency","No built-in support for distributed streaming across multiple machines"],"requires":["Hugging Face Datasets library (Python 3.7+)","Stable internet connection with sufficient bandwidth","Hugging Face account for dataset access","Training framework integration (PyTorch, TensorFlow, etc.)"],"input_types":["remote dataset (hosted on Hugging Face Hub)"],"output_types":["batched code samples (lazy-loaded into memory)","shuffled/filtered subsets"],"categories":["data-processing-analysis","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"starcoderdata__cap_7","uri":"capability://data.processing.analysis.language.specific.code.filtering.and.sampling","name":"language-specific code filtering and sampling","description":"Enables fine-grained control over dataset composition by language, allowing researchers to sample code by language distribution, exclude specific languages, or oversample underrepresented languages. Provides language-stratified sampling to ensure balanced training across languages or language-specific fine-tuning. Metadata includes language distribution statistics, enabling informed decisions about dataset composition.","intents":["I want to train a model on balanced code from all 86 languages, not just Python and JavaScript","I need to oversample code from niche languages to improve model performance on those languages","I want to train a language-specific code model using only Python code from the dataset"],"best_for":["researchers training polyglot code models with language-balanced data","teams building language-specific code tools (e.g., Rust-specific code generation)","organizations studying how language diversity affects code model performance"],"limitations":["Language distribution is fixed — can't dynamically rebalance without re-curation","Oversampling niche languages may introduce data leakage or reduce diversity","Language detection may be imperfect for polyglot files or embedded code","No built-in support for language-specific preprocessing (e.g., removing language-specific comments)"],"requires":["Language metadata for each code sample (file extension, explicit tag)","Language distribution statistics (available in dataset documentation)","Sampling logic in training pipeline (e.g., stratified sampling, oversampling)"],"input_types":["code samples with language tags"],"output_types":["language-stratified subsets","language distribution statistics","language-specific code samples"],"categories":["data-processing-analysis","code-generation-editing"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"starcoderdata__headline","uri":"capability://data.processing.analysis.curated.code.dataset.for.training.ai.models","name":"curated code dataset for training ai models","description":"A comprehensive 250GB dataset specifically curated for training AI models, featuring high-quality code from 86 programming languages, GitHub issues, and commits, ensuring minimal duplication and privacy compliance.","intents":["best code dataset for AI training","curated dataset for machine learning models","high-quality code dataset for developers","code dataset for multi-language support","AI training data for programming languages"],"best_for":["AI model training","research in programming languages"],"limitations":["may not cover niche programming languages"],"requires":[],"input_types":[],"output_types":[],"categories":["data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":57,"verified":false,"data_access_risk":"high","permissions":["250GB+ disk space for full dataset","Hugging Face account with dataset access permissions","Network bandwidth for downloading 250GB (or access to cached mirrors)","Python 3.7+ with datasets library for programmatic access","PII detection library or custom regex patterns (implementation details not publicly documented)","Ability to parse and modify code without breaking syntax","Computational overhead for scanning 250GB of code (likely batched/parallelized)","Language-specific parsers or linters for 86 languages (tree-sitter, language-specific AST tools)","License detection library (e.g., licensename, SPDX metadata parsing)","Configurable thresholds for file size, complexity, and other metrics"],"failure_modes":["Near-deduplication is probabilistic — some similar code may remain; exact deduplication would require O(n²) comparisons","250GB is still large; requires significant storage and bandwidth for download/processing","Language distribution may be imbalanced (e.g., Python/JavaScript likely overrepresented vs niche languages)","Deduplication thresholds are fixed — no tuning for domain-specific redundancy tolerance","PII detection is not perfect — some obfuscated or domain-specific sensitive data may slip through","Overly aggressive filtering may remove legitimate code (e.g., example email addresses in documentation)","Regex-based detection doesn't understand context — may flag false positives in variable names or test data","No transparency into which specific PII patterns are detected or how redaction is applied","Quality metrics are heuristic-based and may not correlate with actual code usefulness for model training","Syntax validity checks require language-specific parsers — some languages may be under-represented if parsing is incomplete","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.7,"quality":0.8500000000000001,"ecosystem":0.39999999999999997,"match_graph":0.25,"freshness":0.75,"weights":{"adoption":0.3,"quality":0.25,"ecosystem":0.1,"match_graph":0.3,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-05-24T12:16:28.695Z","last_scraped_at":null,"last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=starcoderdata","compare_url":"https://unfragile.ai/compare?artifact=starcoderdata"}},"signature":"CifB2DnuOV5VerqFT0W89uvh+pIqhUpFLt0HNWC7WAT33fS5yYMLW+kYmA0RrHfOIq/oc8C1SLWtb1i4+71pDA==","signedAt":"2026-06-19T20:41:28.848Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/starcoderdata","artifact":"https://unfragile.ai/starcoderdata","verify":"https://unfragile.ai/api/v1/verify?slug=starcoderdata","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}