{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"starcoder-data","slug":"starcoder-data","name":"StarCoder Data","type":"dataset","url":"https://huggingface.co/datasets/bigcode/starcoderdata","page_url":"https://unfragile.ai/starcoder-data","categories":["model-training"],"tags":[],"pricing":{"model":"free","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"starcoder-data__cap_0","uri":"capability://data.processing.analysis.multi.language.code.corpus.assembly.with.permissive.licensing.verification","name":"multi-language code corpus assembly with permissive licensing verification","description":"Aggregates 783 GB of source code across 86 programming languages from publicly available repositories, filtering exclusively for permissively licensed code (MIT, Apache 2.0, BSD, etc.) to ensure legal trainability. Uses license detection via SPDX identifiers and repository metadata scanning to validate licensing status at collection time, preventing inclusion of GPL or proprietary code that would create legal friction for downstream model training.","intents":["I need a large, legally-safe code dataset to train or fine-tune a code LLM without licensing disputes","I want to understand what programming languages and code patterns are represented in modern open-source ecosystems","I need to audit which repositories and licenses contributed to a training corpus for compliance documentation"],"best_for":["ML teams training code models at scale","organizations building proprietary code LLMs with legal/compliance requirements","researchers studying code distribution across programming languages"],"limitations":["Permissive-only filtering excludes GPL and AGPL code, reducing diversity in certain domains (Linux kernel, GNU tools)","License detection relies on repository metadata which may be incomplete or incorrect for ~2-5% of sources","No dynamic license updates — dataset is a snapshot; licensing changes post-collection are not reflected"],"requires":["Sufficient storage for 783 GB uncompressed (or ~200 GB compressed)","Hugging Face account for dataset access","Network bandwidth for download (multi-hour transfer typical)"],"input_types":["GitHub repository URLs","SPDX license identifiers","Repository metadata (license files, package manifests)"],"output_types":["raw source code files","structured dataset with metadata (language, license, repository origin)","parquet/arrow format for efficient streaming"],"categories":["data-processing-analysis","code-training-datasets"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"starcoder-data__cap_1","uri":"capability://data.processing.analysis.near.deduplication.and.exact.deduplication.with.semantic.similarity.detection","name":"near-deduplication and exact deduplication with semantic similarity detection","description":"Applies two-stage deduplication: exact string matching to remove byte-for-byte duplicates, followed by near-deduplication using MinHash/Jaccard similarity (typically threshold ~0.85) to identify and remove near-identical code blocks that differ only in whitespace, comments, or minor variable renames. This reduces redundancy while preserving legitimate code diversity, preventing the model from overweighting common boilerplate or copy-pasted snippets.","intents":["I want to remove redundant training examples so the model learns diverse patterns instead of memorizing duplicates","I need to understand how much of the dataset is truly unique code vs. repeated boilerplate","I want to ensure my training corpus doesn't waste compute on learning the same function 10,000 times"],"best_for":["teams optimizing training efficiency and model generalization","researchers studying code diversity and reuse patterns","organizations with limited compute budgets needing high-quality training data"],"limitations":["Near-deduplication threshold (0.85) is a heuristic — may remove legitimately similar but distinct implementations","Deduplication is one-directional; cannot reconstruct which original files were merged","Computationally expensive for 783 GB — requires distributed processing; single-machine dedup would take weeks"],"requires":["Distributed compute cluster or cloud infrastructure (Spark, Ray, or similar)","MinHash/Jaccard similarity library (e.g., datasketch, minhash-rs)","Sufficient RAM for similarity index (~50-100 GB for full dataset)"],"input_types":["raw source code files","code snippets with variable length (10 lines to 10,000 lines)"],"output_types":["deduplicated code corpus","deduplication report (% removed, similarity distribution)","mapping of removed duplicates to canonical versions"],"categories":["data-processing-analysis","code-training-datasets"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"starcoder-data__cap_2","uri":"capability://safety.moderation.personally.identifiable.information.redaction.with.multi.pattern.detection","name":"personally identifiable information redaction with multi-pattern detection","description":"Scans the entire 783 GB corpus for PII patterns including email addresses, IP addresses (IPv4/IPv6), API keys, private keys, and other sensitive credentials using regex-based pattern matching and entropy-based detection. Redacts or removes identified PII before dataset release, protecting developer privacy and preventing accidental exposure of secrets in the training data that could be memorized and leaked by the model.","intents":["I need to ensure the training dataset doesn't contain leaked API keys, passwords, or private credentials","I want to protect developer privacy by removing email addresses and personal identifiers from code","I need to audit what PII was detected and redacted for compliance and transparency reporting"],"best_for":["organizations with privacy/compliance requirements (GDPR, CCPA, SOC 2)","teams concerned about model memorization of secrets","projects requiring transparency in data provenance and cleaning"],"limitations":["Pattern-based detection has false negatives — obfuscated or unusual credential formats may be missed","Entropy-based detection can produce false positives (random-looking variable names flagged as keys)","No context-aware redaction — cannot distinguish between a real API key and a placeholder string","Redaction is lossy; cannot reconstruct original PII, limiting post-hoc verification"],"requires":["Regex engine supporting lookahead/lookbehind (Python re or similar)","Entropy calculation library (e.g., Shannon entropy)","Distributed scanning infrastructure for 783 GB (Spark, MapReduce, or similar)"],"input_types":["raw source code files","configuration files","comments and docstrings"],"output_types":["redacted source code","PII detection report (count by type, redaction rate)","audit log of redacted patterns"],"categories":["safety-moderation","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"starcoder-data__cap_3","uri":"capability://data.processing.analysis.jupyter.notebook.code.text.interleaving.preservation","name":"jupyter notebook code-text interleaving preservation","description":"Extracts and preserves code cells and markdown text from Jupyter notebooks as interleaved sequences, maintaining the pedagogical structure where explanatory text precedes or follows code blocks. This allows models trained on the dataset to learn the relationship between natural language documentation and code implementation, improving code generation quality when models can reference explanatory context.","intents":["I want to train a model that understands code in the context of explanatory text and documentation","I need to preserve the learning structure of notebooks so the model learns how humans explain code","I want to improve code generation by providing models with examples of well-documented code patterns"],"best_for":["teams training code-generation models that need to produce documented code","educational AI systems that should explain code as they generate it","researchers studying code-documentation relationships"],"limitations":["Notebook extraction is format-specific (.ipynb JSON) — requires custom parsing for other notebook formats","Interleaving structure is lost if notebooks are flattened to pure code or pure text","Notebook execution state (variables, outputs) is not preserved — only source code and markdown","Quality varies widely; many notebooks contain incomplete, broken, or pedagogically poor code"],"requires":["Jupyter notebook parser (nbformat library or equivalent)","Ability to handle JSON parsing at scale","Storage for both code and markdown in structured format"],"input_types":[".ipynb files","notebook cells (code and markdown)","cell execution order metadata"],"output_types":["interleaved code-text sequences","structured format preserving cell boundaries and types","flattened code-only or text-only variants"],"categories":["data-processing-analysis","code-training-datasets"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"starcoder-data__cap_4","uri":"capability://safety.moderation.developer.opt.out.mechanism.with.repository.level.granularity","name":"developer opt-out mechanism with repository-level granularity","description":"Provides a mechanism for developers to request exclusion of their repositories from the dataset, respecting developer autonomy and addressing concerns about code being used for AI training without consent. Maintains an opt-out registry that is checked during dataset construction and updates, allowing developers to remove their code retroactively or prevent future inclusion.","intents":["I want to exclude my code from being used to train AI models without my consent","I need to understand which developers have opted out and why","I want to ensure my dataset respects developer preferences and ethical concerns"],"best_for":["organizations building datasets with ethical/consent considerations","projects addressing developer concerns about AI training data","teams needing to demonstrate respect for developer autonomy"],"limitations":["Opt-out is repository-level, not file-level — cannot exclude specific files within a repository","Opt-out mechanism is manual/administrative — requires developer to actively request exclusion","No retroactive removal from already-trained models — only affects future dataset versions","Opt-out registry is not publicly queryable in real-time; updates are batched"],"requires":["Administrative process for handling opt-out requests (email, web form, or API)","Opt-out registry database (simple list or structured store)","Integration into dataset construction pipeline to filter excluded repositories"],"input_types":["developer opt-out requests","repository identifiers (GitHub URLs, owner/repo pairs)","opt-out reason/metadata"],"output_types":["updated opt-out registry","filtered dataset excluding opted-out repositories","opt-out statistics and transparency report"],"categories":["safety-moderation","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"starcoder-data__cap_5","uri":"capability://data.processing.analysis.multi.language.code.representation.with.language.specific.tokenization","name":"multi-language code representation with language-specific tokenization","description":"Organizes and represents code across 86 programming languages, applying language-specific parsing and tokenization strategies to preserve syntactic structure. Enables downstream models to learn language-specific patterns (e.g., Python indentation, Rust ownership, JavaScript async/await) rather than treating all code as generic text, improving language-specific code generation quality.","intents":["I want to train a multilingual code model that understands language-specific syntax and idioms","I need to analyze code distribution across languages and identify underrepresented languages","I want to ensure my model learns language-specific best practices and patterns"],"best_for":["teams building polyglot code models","organizations needing language-specific code generation","researchers studying code patterns across programming languages"],"limitations":["Language detection is imperfect — mixed-language files (e.g., HTML with embedded JavaScript) may be misclassified","Language-specific tokenization requires parsers for each language — adds complexity and maintenance burden","Code distribution is imbalanced — Python and JavaScript dominate; rare languages (Cobol, Fortran) are underrepresented","Language-specific features (e.g., type annotations) may not be preserved uniformly across languages"],"requires":["Language detection library (e.g., Linguist, tree-sitter)","Language-specific parsers or tokenizers (tree-sitter for 86+ languages)","Metadata tagging for language identification"],"input_types":["source code files in 86 programming languages","file extensions and language metadata","code content for language detection"],"output_types":["language-tagged code corpus","language-specific token sequences","language distribution statistics","language-specific subsets for targeted training"],"categories":["data-processing-analysis","code-training-datasets"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"starcoder-data__cap_6","uri":"capability://data.processing.analysis.github.issues.and.git.commit.message.inclusion.for.context.and.intent","name":"github issues and git commit message inclusion for context and intent","description":"Incorporates GitHub issues and Git commit messages alongside source code, providing natural language context about code changes, bug fixes, and feature requests. This allows models to learn the relationship between code changes and their motivations, improving code generation quality by training on examples where code is paired with explanatory intent.","intents":["I want to train a model that understands code in the context of issues and commit messages","I need to improve code generation by providing models with examples of code changes paired with explanatory text","I want to understand how code changes relate to issues and feature requests"],"best_for":["teams training code-generation models that need to understand code intent","organizations building AI systems for code review and change explanation","researchers studying the relationship between code changes and natural language descriptions"],"limitations":["Issue and commit message quality varies widely — many are vague, incomplete, or poorly written","Linking issues to code changes is heuristic-based (commit message parsing) — may miss or misattribute relationships","Issue/commit data is less structured than code — requires additional parsing and cleaning","Privacy concerns with issue discussions — may contain sensitive information beyond code"],"requires":["GitHub API access or repository metadata export","Git history parsing (GitPython or equivalent)","Natural language processing for issue-code linking"],"input_types":["GitHub issues (title, description, comments)","Git commit messages","code diffs and changes","issue-commit linking metadata"],"output_types":["code-issue-commit triples","structured dataset with code, intent, and change context","natural language descriptions of code changes"],"categories":["data-processing-analysis","code-training-datasets"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"starcoder-data__cap_7","uri":"capability://data.processing.analysis.large.scale.distributed.dataset.processing.and.streaming","name":"large-scale distributed dataset processing and streaming","description":"Implements distributed processing pipeline for 783 GB of code using frameworks like Spark or Ray, enabling efficient deduplication, PII redaction, and language detection across multiple machines. Provides streaming/chunked access patterns (Hugging Face Datasets format) to allow downstream users to load and process the dataset without requiring full 783 GB in memory, using lazy evaluation and batch processing.","intents":["I want to train a model on 783 GB of code without loading it all into memory","I need to process the dataset efficiently across multiple GPUs/TPUs","I want to stream subsets of the dataset for iterative training and experimentation"],"best_for":["teams with distributed compute infrastructure (Spark, Ray, Kubernetes)","organizations training large models with memory constraints","researchers doing iterative experimentation on code datasets"],"limitations":["Streaming adds latency — not suitable for single-pass training on high-bandwidth GPUs","Distributed processing requires cluster setup and management — adds operational complexity","Chunking/batching may break code context — large functions split across batches lose semantic meaning","Network I/O can become bottleneck if dataset is accessed from remote storage (S3, GCS)"],"requires":["Hugging Face Datasets library (Python 3.8+)","Distributed compute framework (Spark 3.0+, Ray 1.0+, or equivalent)","Network bandwidth for streaming (100+ Mbps recommended)","Storage backend (local disk, S3, GCS, or Hugging Face Hub)"],"input_types":["raw 783 GB code corpus","processing configuration (batch size, chunk size, sampling rate)","filtering/selection criteria"],"output_types":["streamed code batches","parquet/arrow format for efficient I/O","language-specific subsets","sampled datasets for prototyping"],"categories":["data-processing-analysis","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"starcoder-data__cap_8","uri":"capability://automation.workflow.dataset.versioning.and.reproducibility.tracking","name":"dataset versioning and reproducibility tracking","description":"Maintains versioned snapshots of the dataset with full provenance tracking, including data processing pipeline parameters, deduplication thresholds, PII redaction patterns, and opt-out exclusions applied to each version. Enables reproducible model training by documenting exact dataset composition, enabling researchers to cite specific dataset versions and understand how dataset changes affect model behavior. Supports rollback to previous versions and comparison of dataset statistics across versions.","intents":["Enable reproducible model training by documenting exact dataset composition and processing parameters","Track how dataset changes (new data, deduplication, opt-outs) affect model behavior","Support scientific reproducibility and model auditing by providing dataset provenance"],"best_for":["Research teams publishing models and requiring reproducibility","Organizations auditing model training data for compliance and bias","Projects studying how dataset composition affects model behavior"],"limitations":["Versioning adds storage overhead (multiple snapshots of 783 GB corpus)","Tracking all processing parameters requires detailed logging; incomplete logs reduce reproducibility","Comparing dataset versions requires recomputing statistics, which is computationally expensive","No mechanism to detect if external code sources (GitHub) have changed after dataset creation","Version history can become unwieldy for long-lived datasets with frequent updates"],"requires":["Version control system (Git, DVC) for dataset snapshots","Metadata database tracking processing parameters and exclusions","Logging and audit trail infrastructure","Storage for multiple dataset versions (multi-TB)"],"input_types":["Dataset snapshots at different points in time","Processing pipeline parameters and configuration","Opt-out requests and exclusion lists"],"output_types":["Versioned dataset releases with version identifiers","Dataset cards documenting composition and processing parameters","Provenance logs and audit trails","Dataset comparison reports"],"categories":["automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"starcoder-data__headline","uri":"capability://model.training.curated.code.training.dataset.for.ai.models","name":"curated code training dataset for ai models","description":"A comprehensive dataset of 783 GB of permissively licensed code from 86 programming languages, ideal for training AI models on code understanding and generation tasks.","intents":["best dataset for AI code training","curated code dataset for machine learning","training data for AI coding models","large code dataset for model training","AI training dataset with GitHub issues"],"best_for":["AI model training","code generation tasks"],"limitations":[],"requires":[],"input_types":[],"output_types":[],"categories":["model-training"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":56,"verified":false,"data_access_risk":"high","permissions":["Sufficient storage for 783 GB uncompressed (or ~200 GB compressed)","Hugging Face account for dataset access","Network bandwidth for download (multi-hour transfer typical)","Distributed compute cluster or cloud infrastructure (Spark, Ray, or similar)","MinHash/Jaccard similarity library (e.g., datasketch, minhash-rs)","Sufficient RAM for similarity index (~50-100 GB for full dataset)","Regex engine supporting lookahead/lookbehind (Python re or similar)","Entropy calculation library (e.g., Shannon entropy)","Distributed scanning infrastructure for 783 GB (Spark, MapReduce, or similar)","Jupyter notebook parser (nbformat library or equivalent)"],"failure_modes":["Permissive-only filtering excludes GPL and AGPL code, reducing diversity in certain domains (Linux kernel, GNU tools)","License detection relies on repository metadata which may be incomplete or incorrect for ~2-5% of sources","No dynamic license updates — dataset is a snapshot; licensing changes post-collection are not reflected","Near-deduplication threshold (0.85) is a heuristic — may remove legitimately similar but distinct implementations","Deduplication is one-directional; cannot reconstruct which original files were merged","Computationally expensive for 783 GB — requires distributed processing; single-machine dedup would take weeks","Pattern-based detection has false negatives — obfuscated or unusual credential formats may be missed","Entropy-based detection can produce false positives (random-looking variable names flagged as keys)","No context-aware redaction — cannot distinguish between a real API key and a placeholder string","Redaction is lossy; cannot reconstruct original PII, limiting post-hoc verification","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.7,"quality":0.8500000000000001,"ecosystem":0.3,"match_graph":0.25,"freshness":0.75,"weights":{"adoption":0.3,"quality":0.25,"ecosystem":0.1,"match_graph":0.3,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-05-24T12:16:28.695Z","last_scraped_at":null,"last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=starcoder-data","compare_url":"https://unfragile.ai/compare?artifact=starcoder-data"}},"signature":"s50kVBnq+ZyzPI+xkVwZQDojRxq1bSrTHb332IZ6Tiv+ThG0blu6lJUxClh+XA6/djEz0ERcORasSQt/34peAw==","signedAt":"2026-06-20T08:03:09.237Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/starcoder-data","artifact":"https://unfragile.ai/starcoder-data","verify":"https://unfragile.ai/api/v1/verify?slug=starcoder-data","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}