{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"the-stack-v2","slug":"the-stack-v2","name":"The Stack v2","type":"dataset","url":"https://huggingface.co/datasets/bigcode/the-stack-v2","page_url":"https://unfragile.ai/the-stack-v2","categories":["model-training"],"tags":[],"pricing":{"model":"free","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"the-stack-v2__cap_0","uri":"capability://data.processing.analysis.permissively.licensed.source.code.dataset.curation.and.aggregation","name":"permissively-licensed source code dataset curation and aggregation","description":"Aggregates 67 TB of source code from the Software Heritage archive, filtering for permissively licensed repositories (MIT, Apache 2.0, BSD, etc.) across 600+ programming languages. Uses automated license detection and validation to ensure legal compliance for model training. Implements a rigorous deduplication pipeline at file and repository levels to eliminate redundant training data and reduce dataset bloat.","intents":["Train code generation models on legally compliant, diverse source code without licensing risk","Build a comprehensive multi-language code corpus that covers niche and mainstream languages equally","Ensure training data quality by removing duplicate code patterns that would skew model learning"],"best_for":["ML teams training large code LLMs (10B+ parameters)","Open-source model developers needing legally defensible training data","Researchers studying code generation across language families"],"limitations":["Permissive license filtering excludes GPL and AGPL code, limiting coverage of certain ecosystems (Linux kernel, GNU tools)","Deduplication is content-based, not semantic — similar algorithms in different styles may be retained as duplicates","License detection relies on heuristics and file headers; edge cases with dual-licensing or custom licenses may be misclassified","67 TB dataset requires significant storage infrastructure and bandwidth for download/processing"],"requires":["Hugging Face account for dataset access","Minimum 100 GB free disk space for partial dataset, 500+ GB for full dataset","Network bandwidth for multi-TB download (recommend enterprise connection)","Python 3.8+ with datasets library for programmatic access"],"input_types":["repository metadata from Software Heritage","source code files in 600+ languages","license declarations and SPDX identifiers"],"output_types":["deduplicated code files with language tags","repository-level metadata (owner, license, language distribution)","training-ready tokenized sequences for LLM fine-tuning"],"categories":["data-processing-analysis","code-generation-editing"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"the-stack-v2__cap_1","uri":"capability://safety.moderation.opt.out.governance.and.repository.exclusion.management","name":"opt-out governance and repository exclusion management","description":"Implements a community-driven opt-out system where repository owners can request removal of their code from the dataset without legal takedown notices. Maintains a registry of excluded repositories and re-applies exclusions during dataset updates. Provides transparent governance documentation and a clear submission process for removal requests, balancing open access with creator rights.","intents":["Allow developers to exclude their proprietary or sensitive code from public training datasets","Maintain ethical data practices by respecting creator preferences without requiring legal action","Ensure dataset freshness while honoring historical opt-out requests across versions"],"best_for":["Open-source projects concerned about code reuse in commercial models","Individual developers wanting control over their code's use in AI training","Organizations building datasets with community trust as a core value"],"limitations":["Opt-out is reactive, not proactive — requires developers to actively request removal","No guarantee of removal from already-trained models using earlier dataset versions","Processing opt-out requests introduces latency; exclusions may not apply until next dataset version","Relies on repository ownership verification; spoofed removal requests possible without strong authentication"],"requires":["Repository ownership verification (GitHub account, email domain ownership, or similar)","Submission to BigCode project's opt-out registry (URL/form TBD)","Processing time of 1-4 weeks for removal to take effect in next dataset release"],"input_types":["repository URL or identifier","owner identity verification","optional reason for exclusion"],"output_types":["confirmation of exclusion request","updated dataset manifest excluding specified repositories","audit log of opt-out decisions"],"categories":["safety-moderation","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"the-stack-v2__cap_2","uri":"capability://safety.moderation.pii.and.sensitive.data.removal.pipeline","name":"pii and sensitive data removal pipeline","description":"Automated pipeline that scans source code for personally identifiable information (email addresses, API keys, SSH keys, credit card patterns, phone numbers) and removes or redacts them before dataset release. Uses regex patterns, entropy-based detection for secrets, and heuristic rules to identify sensitive data. Operates at file level with configurable sensitivity thresholds to balance data utility against privacy risk.","intents":["Remove accidentally committed secrets and credentials from training data to prevent credential leakage in generated code","Protect individual privacy by redacting email addresses and personal identifiers from code comments and strings","Reduce security risk of training models on code containing hardcoded API keys or database credentials"],"best_for":["Teams training models that will generate code in production environments","Privacy-conscious organizations handling code from diverse contributors","Security-focused projects where credential leakage in model outputs is a critical risk"],"limitations":["Regex and entropy-based detection have false positives (e.g., UUIDs flagged as secrets) and false negatives (obfuscated credentials)","Context-aware redaction may reduce code utility — removing variable names or function signatures can break code semantics","PII removal is lossy; original code cannot be recovered from redacted version","Detection rules are language-agnostic; language-specific secrets (e.g., Kubernetes manifests) may be missed","No detection of indirect PII (e.g., usernames in git commit history, though file content is scanned)"],"requires":["Regex engine supporting PCRE or similar for pattern matching","Entropy calculation library for secret detection (Shannon entropy threshold ~3.5 bits/char)","Processing time proportional to dataset size; 67 TB requires distributed scanning infrastructure"],"input_types":["raw source code files in any language","code comments and docstrings","configuration files (JSON, YAML, XML, etc.)"],"output_types":["redacted source code with PII replaced by placeholders","metadata log of redaction locations and types","confidence scores for each redaction decision"],"categories":["safety-moderation","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"the-stack-v2__cap_3","uri":"capability://data.processing.analysis.multi.language.source.code.indexing.and.retrieval","name":"multi-language source code indexing and retrieval","description":"Indexes 67 TB of source code across 600+ programming languages with language-aware metadata (syntax, file extension, language family). Enables retrieval by language, license, repository, or code patterns. Uses Software Heritage's existing indexing infrastructure as foundation, augmented with language detection and classification. Supports both bulk download and filtered queries for specific language subsets.","intents":["Train language-specific code models by filtering dataset to particular languages or language families","Analyze code distribution across languages to understand ecosystem representation in training data","Download language-specific subsets without processing entire 67 TB dataset"],"best_for":["Researchers training specialized models for niche languages (Rust, Go, Kotlin, etc.)","Teams wanting to balance training data across languages rather than defaulting to Python/JavaScript dominance","Organizations with limited storage who need specific language subsets"],"limitations":["Language detection relies on file extensions and heuristics; polyglot files or non-standard extensions may be misclassified","No semantic language understanding — can't distinguish between language variants (e.g., TypeScript vs JavaScript) without explicit markers","Filtering by language requires downloading metadata (~100 GB) even if only querying; no lightweight query API","Language distribution is skewed toward popular languages (Python, JavaScript, Java dominate); niche languages have limited coverage"],"requires":["Hugging Face datasets library or direct S3 access to BigCode bucket","Language detection library (e.g., Linguist, Pygments) for local filtering","Metadata index (~100 GB) for efficient language-based queries"],"input_types":["language identifier (ISO 639-1 code or language name)","file extension patterns","repository metadata filters"],"output_types":["filtered code files matching language criteria","language distribution statistics","per-language dataset splits with size and file counts"],"categories":["data-processing-analysis","search-retrieval"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"the-stack-v2__cap_4","uri":"capability://data.processing.analysis.content.based.deduplication.at.file.and.repository.levels","name":"content-based deduplication at file and repository levels","description":"Removes duplicate code files and repositories using content hashing (SHA-256 or similar) and fuzzy matching for near-duplicates. Operates in two stages: exact deduplication via hash matching, then fuzzy matching (e.g., Jaccard similarity or MinHash) to catch semantically identical code with minor formatting differences. Preserves one canonical copy of each unique code pattern while removing redundant training examples.","intents":["Reduce dataset size and training time by eliminating redundant code that would skew model learning toward common patterns","Improve model generalization by ensuring each unique code pattern appears only once in training data","Identify and remove copy-pasted code across repositories that inflates apparent diversity"],"best_for":["Teams training large models where redundant data increases training cost without improving performance","Researchers studying code diversity and wanting to measure true unique patterns vs. duplicates","Organizations optimizing dataset size for storage and bandwidth constraints"],"limitations":["Fuzzy matching threshold is arbitrary; too strict removes legitimate variations, too loose keeps near-duplicates","Deduplication is content-based, not semantic — functionally equivalent code with different variable names is treated as unique","Exact deduplication removes all but one copy; if the retained copy is low-quality, all duplicates are lost","Deduplication is one-way; can't recover which files were deduplicated without maintaining a mapping","Fuzzy matching is computationally expensive; O(n²) complexity for n files requires distributed processing"],"requires":["Cryptographic hash function (SHA-256) for exact matching","Fuzzy matching library (MinHash, Jaccard similarity, or Levenshtein distance) for near-duplicates","Distributed computing framework (Spark, Dask) for processing 67 TB dataset","Deduplication mapping database to track which files were removed"],"input_types":["raw source code files","file content hashes","similarity threshold (0.0-1.0)"],"output_types":["deduplicated file set with one canonical copy per unique pattern","deduplication mapping (original file → canonical file)","statistics on deduplication ratio and removed file count"],"categories":["data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"the-stack-v2__cap_5","uri":"capability://data.processing.analysis.software.heritage.archive.integration.and.version.control.history.access","name":"software heritage archive integration and version control history access","description":"Integrates with Software Heritage's comprehensive archive of 200+ million repositories and their full version control history. Extracts source code snapshots from Software Heritage's Git/Mercurial/SVN repositories, preserving repository metadata (commit history, author info, timestamps). Provides access to code at specific points in time, enabling historical analysis or training on code evolution patterns.","intents":["Access the largest open-source code archive without needing to clone millions of repositories individually","Train models on code evolution by sampling from different points in repository history","Preserve historical code context and repository metadata for research on code development patterns"],"best_for":["Researchers studying code evolution and development practices across open-source ecosystems","Teams wanting to train models on code at specific historical points (e.g., pre-2020 code for legacy system understanding)","Organizations needing comprehensive open-source code coverage without maintaining their own archive"],"limitations":["Software Heritage archive is read-only; no ability to update or modify archived code","Repository metadata (author names, emails, commit messages) may contain PII that requires additional redaction","Not all repositories in Software Heritage are included in The Stack v2 (only permissively licensed ones)","Version control history is preserved but not actively used for training; dataset is snapshot-based, not history-aware","Access to Software Heritage API may have rate limits or availability constraints"],"requires":["Software Heritage API access (public, but may have rate limits)","Knowledge of Software Heritage identifiers (SWHIDs) for repository lookup","Processing time for extracting snapshots from version control systems"],"input_types":["Software Heritage repository identifiers (SWHIDs)","Git/Mercurial/SVN repository URLs","commit hashes or timestamps for historical snapshots"],"output_types":["source code files from specific repository snapshots","repository metadata (owner, license, language, commit count)","version control history (optional, for research use)"],"categories":["data-processing-analysis","memory-knowledge"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"the-stack-v2__cap_6","uri":"capability://safety.moderation.license.compliance.and.legal.metadata.tracking","name":"license compliance and legal metadata tracking","description":"Tracks and validates SPDX license identifiers for each repository, ensuring only permissively licensed code (MIT, Apache 2.0, BSD, etc.) is included. Maintains license metadata alongside code files, enabling downstream users to verify legal compliance. Implements license hierarchy and compatibility checking to handle dual-licensed or complex licensing scenarios.","intents":["Ensure models trained on The Stack v2 can be legally used commercially without license violations","Provide license metadata for each code file so downstream users can verify compliance with their own legal requirements","Document licensing decisions for transparency and auditability"],"best_for":["Commercial AI companies training models for production use without licensing risk","Organizations with strict legal compliance requirements (financial services, healthcare)","Open-source projects wanting to ensure their training data is compatible with their own licenses"],"limitations":["License detection relies on SPDX identifiers and file headers; custom or non-standard licenses may be missed","Dual-licensed repositories require manual review; automated detection may pick wrong license branch","License compatibility checking is complex; some licenses have subtle incompatibilities not captured by simple rules","License metadata is only as good as repository declarations; many repos lack proper license files","No tracking of license changes over time; dataset is snapshot-based, not history-aware"],"requires":["SPDX license list and compatibility matrix","License detection library (e.g., licensename, FOSSOLOGY) for automated identification","Manual review process for ambiguous or custom licenses"],"input_types":["repository license declarations (LICENSE file, package.json, setup.py, etc.)","SPDX identifiers","license text for validation"],"output_types":["validated SPDX license identifier per repository","license compatibility assessment","license metadata file accompanying code files","audit log of license decisions"],"categories":["safety-moderation","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"the-stack-v2__cap_7","uri":"capability://automation.workflow.dataset.versioning.and.reproducibility.tracking","name":"dataset versioning and reproducibility tracking","description":"Maintains versioned snapshots of the dataset (e.g., v2.0, v2.1) with documented changes between versions (new repositories added, deduplication improvements, PII removal updates). Provides checksums and manifests for reproducibility, enabling researchers to cite specific dataset versions and reproduce results. Tracks dataset lineage and transformation history.","intents":["Enable reproducible research by allowing researchers to cite and access exact dataset versions used in model training","Track dataset evolution and understand how changes between versions affect model performance","Provide audit trail for dataset modifications and quality improvements"],"best_for":["Academic researchers publishing papers and needing reproducible datasets","Teams training multiple model versions and wanting to isolate dataset changes from model changes","Organizations with compliance requirements for data lineage and audit trails"],"limitations":["Versioning adds storage overhead; maintaining multiple snapshots of 67 TB dataset is expensive","Version diffs are large; can't easily identify what changed between versions without downloading both","Reproducibility is limited to dataset content; doesn't capture deduplication or PII removal parameters that may vary","Versioning is manual process; no automatic versioning on every change"],"requires":["Version control system or dataset registry (e.g., Hugging Face Hub) supporting versioning","Checksum/manifest generation (SHA-256 for files, JSON for metadata)","Documentation of changes between versions"],"input_types":["dataset snapshot at point in time","changelog documenting modifications","configuration parameters for deduplication and PII removal"],"output_types":["versioned dataset with semantic version number (e.g., v2.0)","manifest file listing all files and checksums","changelog documenting changes from previous version","reproducibility metadata (deduplication threshold, PII removal rules, etc.)"],"categories":["automation-workflow","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"the-stack-v2__cap_8","uri":"capability://data.processing.analysis.training.data.preparation.and.tokenization.for.llm.fine.tuning","name":"training data preparation and tokenization for llm fine-tuning","description":"Provides pre-processed code files formatted for direct use in LLM training pipelines, with optional tokenization using standard tokenizers (GPT-2, GPT-3, Llama, etc.). Includes language-specific formatting (e.g., preserving indentation for Python, handling multi-line strings) and optional code-specific preprocessing (e.g., removing comments, normalizing whitespace). Supports both raw code and tokenized sequences depending on downstream model architecture.","intents":["Use dataset directly for fine-tuning code LLMs without additional preprocessing","Train models with language-aware formatting that preserves code semantics","Experiment with different tokenizers and preprocessing strategies without re-downloading raw data"],"best_for":["ML teams training code models and wanting to skip data preprocessing steps","Researchers experimenting with different tokenization strategies","Organizations with limited data engineering resources"],"limitations":["Pre-tokenized data is locked to specific tokenizer; can't easily switch tokenizers without re-processing","Language-specific formatting may not match all model architectures; some models expect normalized whitespace","Preprocessing choices (e.g., comment removal) are opinionated; may not match downstream use cases","Tokenized sequences lose original code structure; can't recover exact formatting from tokens","No support for custom preprocessing pipelines; users must implement their own if needed"],"requires":["Tokenizer library (transformers, tiktoken, etc.) for tokenization","Understanding of target model's input format (sequence length, special tokens, etc.)","Storage for tokenized sequences (may be larger or smaller than raw code depending on tokenizer)"],"input_types":["raw source code files","tokenizer specification (GPT-2, GPT-3, Llama, etc.)","preprocessing options (comment removal, whitespace normalization, etc.)"],"output_types":["tokenized sequences ready for LLM training","token count statistics per language","metadata mapping tokens back to original files (optional)"],"categories":["data-processing-analysis","code-generation-editing"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"the-stack-v2__cap_9","uri":"capability://code.generation.editing.training.data.for.starcoder2.and.code.generation.models","name":"training data for starcoder2 and code generation models","description":"Serves as the primary training dataset for StarCoder2 models and other code generation models. Provides high-quality, permissively-licensed, deduplicated code across 600+ languages with repository context. Enables training of state-of-the-art code LLMs that understand diverse programming paradigms, languages, and coding patterns. Documented as essential resource for reproducing StarCoder2 and training similar models.","intents":["Train state-of-the-art code generation models comparable to StarCoder2","Reproduce StarCoder2 training or fine-tune models on this dataset","Build code understanding models for diverse programming languages and paradigms","Enable research on code model training and evaluation"],"best_for":["model developers training production-grade code LLMs","researchers reproducing or extending StarCoder2 work","organizations building code generation and understanding systems"],"limitations":["Dataset is optimized for code generation; may not be ideal for other code tasks (e.g., code search, vulnerability detection)","Training on 67 TB requires significant computational resources (GPUs, TPUs, distributed training)","Model quality depends on training procedures, hyperparameters, and infrastructure beyond dataset quality","Dataset may contain biases toward popular languages and coding patterns"],"requires":["Significant computational resources for training (GPUs/TPUs, distributed infrastructure)","ML training framework (PyTorch, TensorFlow, JAX) and expertise","Understanding of code model training procedures and best practices","Hugging Face Datasets library for efficient data loading"],"input_types":["raw code dataset from Hugging Face Datasets","training configuration and hyperparameters","model architecture specifications"],"output_types":["trained code generation models","model checkpoints and weights","training logs and evaluation metrics"],"categories":["code-generation-editing","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"the-stack-v2__headline","uri":"capability://model.training.largest.open.source.dataset.for.training.code.generation.models","name":"largest open-source dataset for training code generation models","description":"The Stack v2 is the largest open dataset of permissively licensed source code, covering over 600 programming languages, making it an essential resource for training advanced code generation models like StarCoder2.","intents":["best dataset for code generation","open-source code dataset for training","largest dataset for AI coding models","source code dataset for machine learning","permissively licensed code dataset"],"best_for":["AI model training","research in code generation"],"limitations":[],"requires":[],"input_types":[],"output_types":[],"categories":["model-training"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":58,"verified":false,"data_access_risk":"high","permissions":["Hugging Face account for dataset access","Minimum 100 GB free disk space for partial dataset, 500+ GB for full dataset","Network bandwidth for multi-TB download (recommend enterprise connection)","Python 3.8+ with datasets library for programmatic access","Repository ownership verification (GitHub account, email domain ownership, or similar)","Submission to BigCode project's opt-out registry (URL/form TBD)","Processing time of 1-4 weeks for removal to take effect in next dataset release","Regex engine supporting PCRE or similar for pattern matching","Entropy calculation library for secret detection (Shannon entropy threshold ~3.5 bits/char)","Processing time proportional to dataset size; 67 TB requires distributed scanning infrastructure"],"failure_modes":["Permissive license filtering excludes GPL and AGPL code, limiting coverage of certain ecosystems (Linux kernel, GNU tools)","Deduplication is content-based, not semantic — similar algorithms in different styles may be retained as duplicates","License detection relies on heuristics and file headers; edge cases with dual-licensing or custom licenses may be misclassified","67 TB dataset requires significant storage infrastructure and bandwidth for download/processing","Opt-out is reactive, not proactive — requires developers to actively request removal","No guarantee of removal from already-trained models using earlier dataset versions","Processing opt-out requests introduces latency; exclusions may not apply until next dataset version","Relies on repository ownership verification; spoofed removal requests possible without strong authentication","Regex and entropy-based detection have false positives (e.g., UUIDs flagged as secrets) and false negatives (obfuscated credentials)","Context-aware redaction may reduce code utility — removing variable names or function signatures can break code semantics","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.7,"quality":0.9,"ecosystem":0.3,"match_graph":0.25,"freshness":0.75,"weights":{"adoption":0.3,"quality":0.25,"ecosystem":0.1,"match_graph":0.3,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-05-24T12:16:28.696Z","last_scraped_at":null,"last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=the-stack-v2","compare_url":"https://unfragile.ai/compare?artifact=the-stack-v2"}},"signature":"IsbL2X3UGeT3vnydTBGkNyzpW4z8j9D0+lJI9iNF8IHGFcIz1aq8Pg/8wja+UwPqW8BGIZs9xKq9U4kgrE53Cg==","signedAt":"2026-06-21T06:57:34.104Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/the-stack-v2","artifact":"https://unfragile.ai/the-stack-v2","verify":"https://unfragile.ai/api/v1/verify?slug=the-stack-v2","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}