The Stack v2
DatasetFree67 TB permissively licensed code dataset across 600+ languages.
Capabilities10 decomposed
permissively-licensed source code dataset curation at scale
Medium confidenceAggregates 67 TB of source code from the Software Heritage archive with automated license classification and filtering to retain only permissively licensed content (Apache 2.0, MIT, BSD, GPL variants, etc.). Uses metadata-driven filtering pipelines to exclude proprietary and restrictive licenses, enabling legal compliance for model training without manual license auditing. Implements a Software Heritage integration layer to access the largest open-source repository snapshot available.
Largest permissively-licensed code dataset (67 TB across 600+ languages) sourced from Software Heritage archive with automated license filtering pipeline, enabling legal training of open-source models at unprecedented scale without manual auditing
Larger and more legally vetted than GitHub-only datasets (CodeSearchNet, GitHub-Code) and includes non-GitHub repositories, while maintaining strict permissive licensing unlike raw GitHub dumps that require post-hoc filtering
multi-language source code normalization and deduplication
Medium confidenceImplements a rigorous deduplication pipeline that identifies and removes duplicate code across 600+ programming languages using content-based hashing and semantic similarity detection. Normalizes code formatting, whitespace, and comments to identify near-duplicates that would otherwise inflate dataset size and introduce training bias. Uses language-specific tokenization and AST-aware comparison for structural duplicates, not just string matching.
Language-aware deduplication across 600+ languages using content hashing and AST-based structural comparison, not just string matching, to identify near-duplicates and boilerplate code that would bias model training
More sophisticated than simple hash-based deduplication used in CodeSearchNet; handles language-specific formatting variations and generated code patterns that generic string matching would miss
personally identifiable information (pii) detection and removal
Medium confidenceApplies automated PII detection pipelines to identify and redact sensitive information (email addresses, API keys, credentials, personal names, phone numbers, etc.) from source code before dataset release. Uses pattern matching, regex-based detection, and potentially ML-based classifiers to find PII in comments, strings, and code. Implements configurable redaction strategies (masking, removal, replacement with placeholders) while preserving code functionality.
Automated PII detection and redaction pipeline applied across 67 TB of code to remove credentials, emails, names, and sensitive data before public release, with configurable redaction strategies that preserve code functionality
More comprehensive than manual review or simple regex patterns; applies consistent PII removal at scale across diverse code repositories, reducing privacy risks in publicly released training data
opt-out governance model for repository exclusion
Medium confidenceImplements a governance framework allowing repository owners to request exclusion of their code from the dataset via an opt-out mechanism (e.g., registry, email contact, automated API). Processes exclusion requests, removes matching repositories from the dataset, and maintains an exclusion list for future dataset versions. Respects developer autonomy and copyright concerns while maintaining dataset openness by default.
Opt-out governance model allowing repository owners to request exclusion from the dataset, respecting developer autonomy and copyright concerns while maintaining an open-by-default approach to dataset curation
More developer-friendly than opt-in models (which would require explicit consent from millions of developers) while more respectful than no-opt-out approaches; balances openness with individual control
600+ programming language support with language-specific metadata
Medium confidenceCovers source code across 600+ programming languages with language-specific metadata (syntax, paradigm, ecosystem, file extensions, etc.). Implements language detection and classification pipelines to identify code language, extract language-specific features, and organize data by language family. Enables language-stratified sampling and analysis, supporting diverse model training use cases from general-purpose to language-specific code models.
Comprehensive coverage of 600+ programming languages with language-specific metadata and classification, enabling stratified sampling and language-aware model training at unprecedented scale and diversity
Broader language coverage than GitHub-only datasets (typically 10-20 languages) and more structured language metadata than raw code dumps; supports both general-purpose and language-specific model training
repository-level metadata enrichment and context preservation
Medium confidencePreserves and enriches repository-level metadata including creation date, last update, star count, fork count, contributor count, license type, and language distribution. Maintains file-to-repository mappings and directory structure information, enabling context-aware model training that understands code within its repository ecosystem. Implements metadata aggregation from Software Heritage and GitHub APIs to provide rich contextual information for each code sample.
Preserves rich repository-level metadata (stars, forks, creation date, contributor count, license) alongside code content, enabling context-aware model training that understands code within its ecosystem and quality signals
More comprehensive than raw code dumps; provides repository context that enables quality-aware training and downstream applications like code search, while maintaining file-to-repository mappings for structured analysis
software heritage archive integration and snapshot access
Medium confidenceIntegrates with the Software Heritage archive, a comprehensive snapshot of open-source software repositories worldwide, to access code at scale without relying on individual repository APIs or GitHub. Implements Software Heritage API clients and data export pipelines to retrieve code content, metadata, and version history. Enables reproducible dataset snapshots by referencing specific Software Heritage revisions, supporting dataset versioning and reproducibility.
Leverages Software Heritage archive as the data source, providing comprehensive open-source code snapshot with reproducible versioning via SWHIDs, independent of GitHub or any single platform
More comprehensive and platform-independent than GitHub-only datasets; enables reproducible snapshots and includes non-GitHub repositories, while avoiding API rate limits and platform dependency
dataset versioning and release management
Medium confidenceImplements versioning and release management for dataset versions (v1, v2, etc.) with documented changes, improvements, and data quality enhancements between versions. Maintains version-specific documentation, changelog, and reproducibility information. Enables users to select specific dataset versions for training, ensuring reproducibility and allowing comparison of model performance across dataset versions.
Implements explicit dataset versioning (v1, v2) with documented improvements and reproducibility information, enabling users to specify exact dataset versions for training and supporting reproducible research
More structured than continuously updated datasets; enables reproducibility and comparison across versions, while providing clear documentation of improvements and changes between releases
hugging face datasets integration and streaming access
Medium confidenceIntegrates with Hugging Face Datasets library, providing standardized dataset loading, streaming, and sampling interfaces. Implements dataset cards with documentation, license information, and usage guidelines. Enables efficient streaming access to 67 TB dataset without downloading entire dataset, supporting memory-constrained training environments. Provides dataset splits, sampling strategies, and preprocessing utilities for common training workflows.
Native Hugging Face Datasets integration with streaming access to 67 TB dataset, enabling efficient training without full download while providing standardized dataset cards and preprocessing utilities
More convenient than raw data downloads for Hugging Face users; streaming access reduces storage requirements, while standardized dataset cards provide clear documentation and usage guidelines
training data for starcoder2 and code generation models
Medium confidenceServes as the primary training dataset for StarCoder2 models and other code generation models. Provides high-quality, permissively-licensed, deduplicated code across 600+ languages with repository context. Enables training of state-of-the-art code LLMs that understand diverse programming paradigms, languages, and coding patterns. Documented as essential resource for reproducing StarCoder2 and training similar models.
Curated and published as the official training dataset for StarCoder2 models, providing permissively-licensed, deduplicated, PII-removed code across 600+ languages with repository context and governance
More comprehensive and higher-quality than previous code datasets (CodeSearchNet, GitHub-Code) with rigorous deduplication, PII removal, and licensing compliance; enables training of state-of-the-art code models
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with The Stack v2, ranked by overlap. Discovered automatically through the match graph.
StarCoder Data
783 GB curated code dataset from 86 languages with PII redaction.
Private AI
Multi-modal PII detection and redaction API for 49 languages.
StarCoderData
250GB curated code dataset for StarCoder training.
rehydra
A zero-trust SDK for anonymizing PII locally before sending prompts to LLMs and seamlessly rehydrating the response.
Granite
IBM's enterprise-focused open foundation models.
Lakera Guard
Real-time prompt injection and LLM threat detection API.
Best For
- ✓open-source model developers training code LLMs
- ✓research teams building code understanding benchmarks
- ✓organizations committed to open-source licensing compliance
- ✓model trainers optimizing dataset quality and training efficiency
- ✓researchers studying code diversity and representation in training data
- ✓teams building code datasets with limited storage budgets
- ✓responsible AI practitioners building datasets for public release
- ✓organizations subject to privacy regulations
Known Limitations
- ⚠License classification relies on repository metadata and file headers — some licenses may be misclassified or missing
- ⚠Excludes valuable proprietary code and restrictive licenses (GPL-only, SSPL, etc.) which may limit model diversity
- ⚠Software Heritage snapshot is point-in-time; doesn't continuously track new repositories or license changes
- ⚠Deduplication is lossy — removes legitimate variations of common patterns (e.g., multiple implementations of quicksort)
- ⚠Language-specific deduplication requires parsers for each language; some esoteric languages may use generic string matching
- ⚠Semantic similarity detection may incorrectly flag functionally equivalent but structurally different code as duplicates
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
BigCode project's 67 TB dataset of permissively licensed source code from Software Heritage archive covering 600+ programming languages. The largest open code dataset available, used to train StarCoder2 models. Includes full file content, repository metadata, and license information. Follows an opt-out governance model allowing repository owners to exclude their code. Rigorous deduplication and PII removal pipeline. Essential resource for training code generation models.
Categories
Alternatives to The Stack v2
The GitHub for AI — 500K+ models, datasets, Spaces, Inference API, hub for open-source AI.
Compare →FLUX, Stable Diffusion, SDXL, SD3, LoRA, Fine Tuning, DreamBooth, Training, Automatic1111, Forge WebUI, SwarmUI, DeepFake, TTS, Animation, Text To Video, Tutorials, Guides, Lectures, Courses, ComfyUI, Google Colab, RunPod, Kaggle, NoteBooks, ControlNet, TTS, Voice Cloning, AI, AI News, ML, ML News,
Compare →Are you the builder of The Stack v2?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →