What can The Stack v2 do?

permissively-licensed source code dataset curation at scale, multi-language source code normalization and deduplication, personally identifiable information (pii) detection and removal, opt-out governance model for repository exclusion, 600+ programming language support with language-specific metadata, repository-level metadata enrichment and context preservation, software heritage archive integration and snapshot access, dataset versioning and release management, hugging face datasets integration and streaming access, training data for starcoder2 and code generation models

The Stack v2

DatasetFree

67 TB permissively licensed code dataset across 600+ languages.

Open Source

/ 100

10 capabilities

Capabilities10 decomposed

permissively-licensed source code dataset curation at scale

Medium confidence

Aggregates 67 TB of source code from the Software Heritage archive with automated license classification and filtering to retain only permissively licensed content (Apache 2.0, MIT, BSD, GPL variants, etc.). Uses metadata-driven filtering pipelines to exclude proprietary and restrictive licenses, enabling legal compliance for model training without manual license auditing. Implements a Software Heritage integration layer to access the largest open-source repository snapshot available.

Solves for

Train code generation models on legally compliant, permissively licensed code without licensing riskBuild datasets for open-source code understanding without proprietary code contaminationEnsure downstream models can be released openly without license violation concerns

Best for

open-source model developers training code LLMs

research teams building code understanding benchmarks

organizations committed to open-source licensing compliance

Requires

Access to Hugging Face datasets library (transformers ecosystem)

Sufficient storage for 67 TB dataset or ability to stream/sample subsets

Understanding of open-source license categories and compliance requirements

Limitations

License classification relies on repository metadata and file headers — some licenses may be misclassified or missing

Excludes valuable proprietary code and restrictive licenses (GPL-only, SSPL, etc.) which may limit model diversity

Software Heritage snapshot is point-in-time; doesn't continuously track new repositories or license changes

What makes it unique

Largest permissively-licensed code dataset (67 TB across 600+ languages) sourced from Software Heritage archive with automated license filtering pipeline, enabling legal training of open-source models at unprecedented scale without manual auditing

vs alternatives

Larger and more legally vetted than GitHub-only datasets (CodeSearchNet, GitHub-Code) and includes non-GitHub repositories, while maintaining strict permissive licensing unlike raw GitHub dumps that require post-hoc filtering

multi-language source code normalization and deduplication

Medium confidence

Implements a rigorous deduplication pipeline that identifies and removes duplicate code across 600+ programming languages using content-based hashing and semantic similarity detection. Normalizes code formatting, whitespace, and comments to identify near-duplicates that would otherwise inflate dataset size and introduce training bias. Uses language-specific tokenization and AST-aware comparison for structural duplicates, not just string matching.

Solves for

Remove duplicate training examples that bias model learning toward overrepresented code patternsReduce dataset size while maintaining diversity and code qualityIdentify and handle boilerplate, generated code, and templated patterns that appear across repositories

Best for

model trainers optimizing dataset quality and training efficiency

researchers studying code diversity and representation in training data

teams building code datasets with limited storage budgets

Requires

Computational resources for hashing and similarity comparison across 67 TB of data

Language-specific parsers or tokenizers for 600+ languages

Deduplication thresholds tuned per language family

Limitations

Deduplication is lossy — removes legitimate variations of common patterns (e.g., multiple implementations of quicksort)

Language-specific deduplication requires parsers for each language; some esoteric languages may use generic string matching

Semantic similarity detection may incorrectly flag functionally equivalent but structurally different code as duplicates

What makes it unique

Language-aware deduplication across 600+ languages using content hashing and AST-based structural comparison, not just string matching, to identify near-duplicates and boilerplate code that would bias model training

vs alternatives

More sophisticated than simple hash-based deduplication used in CodeSearchNet; handles language-specific formatting variations and generated code patterns that generic string matching would miss

personally identifiable information (pii) detection and removal

Medium confidence

Applies automated PII detection pipelines to identify and redact sensitive information (email addresses, API keys, credentials, personal names, phone numbers, etc.) from source code before dataset release. Uses pattern matching, regex-based detection, and potentially ML-based classifiers to find PII in comments, strings, and code. Implements configurable redaction strategies (masking, removal, replacement with placeholders) while preserving code functionality.

Solves for

Protect privacy of developers and users whose information may be embedded in codeEnsure dataset compliance with privacy regulations (GDPR, CCPA) and responsible AI practicesPrevent models from learning to generate or leak sensitive credentials and personal data

Best for

responsible AI practitioners building datasets for public release

organizations subject to privacy regulations

model developers concerned about training data privacy and security

Requires

PII detection models or pattern libraries (regex, ML classifiers)

Manual review or validation of redaction decisions for high-risk cases

Ability to parse and modify code while preserving syntax

Limitations

PII detection is probabilistic — some PII may be missed (false negatives) and some non-PII may be incorrectly flagged (false positives)

Context-dependent PII (e.g., usernames that are also common words) is difficult to detect reliably

Redaction may break code functionality if applied to variable names or API endpoints that are semantically important

What makes it unique

Automated PII detection and redaction pipeline applied across 67 TB of code to remove credentials, emails, names, and sensitive data before public release, with configurable redaction strategies that preserve code functionality

vs alternatives

More comprehensive than manual review or simple regex patterns; applies consistent PII removal at scale across diverse code repositories, reducing privacy risks in publicly released training data

opt-out governance model for repository exclusion

Medium confidence

Implements a governance framework allowing repository owners to request exclusion of their code from the dataset via an opt-out mechanism (e.g., registry, email contact, automated API). Processes exclusion requests, removes matching repositories from the dataset, and maintains an exclusion list for future dataset versions. Respects developer autonomy and copyright concerns while maintaining dataset openness by default.

Solves for

Allow developers to exclude their code from the dataset if they have concerns about licensing, privacy, or usageRespect copyright holders' wishes while maintaining an open-by-default datasetBuild trust with the open-source community through transparent governance

Best for

dataset maintainers balancing openness with developer autonomy

organizations building datasets that respect copyright and developer preferences

projects seeking community trust and ethical data practices

Requires

Exclusion request mechanism (email, web form, API endpoint)

Process for verifying repository ownership and exclusion requests

Maintenance of exclusion registry and version control

Limitations

Opt-out is reactive — requires developers to actively request exclusion; many may not be aware their code is included

Processing exclusion requests at scale requires administrative overhead and verification procedures

Exclusion is typically applied to future dataset versions; already-released versions may contain excluded code

What makes it unique

Opt-out governance model allowing repository owners to request exclusion from the dataset, respecting developer autonomy and copyright concerns while maintaining an open-by-default approach to dataset curation

vs alternatives

More developer-friendly than opt-in models (which would require explicit consent from millions of developers) while more respectful than no-opt-out approaches; balances openness with individual control

600+ programming language support with language-specific metadata

Medium confidence

Covers source code across 600+ programming languages with language-specific metadata (syntax, paradigm, ecosystem, file extensions, etc.). Implements language detection and classification pipelines to identify code language, extract language-specific features, and organize data by language family. Enables language-stratified sampling and analysis, supporting diverse model training use cases from general-purpose to language-specific code models.

Solves for

Train multilingual code models that understand diverse programming paradigms and syntaxBuild language-specific code models by sampling from language subsetsAnalyze code patterns and trends across programming language ecosystemsSupport research on cross-language code understanding and transfer learning

Best for

model developers training general-purpose or language-specific code LLMs

researchers studying programming language diversity and code patterns

teams building code analysis tools that need to handle multiple languages

Requires

Language detection library or classifier (e.g., Linguist, tree-sitter)

Language-specific metadata definitions and taxonomies

Sampling and filtering logic to handle language imbalance

Limitations

Language detection is imperfect for polyglot repositories or ambiguous file types

Metadata quality varies across languages — well-known languages have rich metadata, obscure languages may have minimal information

Imbalanced language representation — popular languages (Python, JavaScript) dominate; rare languages have limited samples

What makes it unique

Comprehensive coverage of 600+ programming languages with language-specific metadata and classification, enabling stratified sampling and language-aware model training at unprecedented scale and diversity

vs alternatives

Broader language coverage than GitHub-only datasets (typically 10-20 languages) and more structured language metadata than raw code dumps; supports both general-purpose and language-specific model training

repository-level metadata enrichment and context preservation

Medium confidence

Preserves and enriches repository-level metadata including creation date, last update, star count, fork count, contributor count, license type, and language distribution. Maintains file-to-repository mappings and directory structure information, enabling context-aware model training that understands code within its repository ecosystem. Implements metadata aggregation from Software Heritage and GitHub APIs to provide rich contextual information for each code sample.

Solves for

Train code models with repository context awareness, understanding code quality signals and ecosystem maturityFilter or weight training data by repository popularity, activity, or quality metricsEnable research on relationships between code quality, repository metrics, and model performanceSupport downstream tasks like code search and recommendation that benefit from repository context

Best for

model developers training context-aware code models

researchers studying code quality and repository metrics

teams building code search and recommendation systems

Requires

Software Heritage API access for repository metadata

GitHub API access for GitHub-hosted repositories (requires authentication)

Data integration and deduplication logic for repositories with multiple sources

Limitations

Metadata is point-in-time snapshot from Software Heritage; doesn't reflect current repository state

GitHub-specific metrics (stars, forks) are not available for non-GitHub repositories

Metadata quality varies — some repositories have incomplete or missing information

What makes it unique

Preserves rich repository-level metadata (stars, forks, creation date, contributor count, license) alongside code content, enabling context-aware model training that understands code within its ecosystem and quality signals

vs alternatives

More comprehensive than raw code dumps; provides repository context that enables quality-aware training and downstream applications like code search, while maintaining file-to-repository mappings for structured analysis

software heritage archive integration and snapshot access

Medium confidence

Integrates with the Software Heritage archive, a comprehensive snapshot of open-source software repositories worldwide, to access code at scale without relying on individual repository APIs or GitHub. Implements Software Heritage API clients and data export pipelines to retrieve code content, metadata, and version history. Enables reproducible dataset snapshots by referencing specific Software Heritage revisions, supporting dataset versioning and reproducibility.

Solves for

Access comprehensive open-source code snapshot without API rate limits or GitHub dependencyEnsure dataset reproducibility by referencing specific Software Heritage snapshotsInclude non-GitHub repositories and historical code versions in training dataBuild datasets that are independent of any single platform's policies or availability

Best for

large-scale dataset builders requiring comprehensive code access

researchers prioritizing reproducibility and long-term data stability

organizations seeking platform-independent code datasets

Requires

Software Heritage API access and authentication

Data export and processing infrastructure for large-scale archive access

Understanding of Software Heritage data model and identifiers (SWHIDs)

Limitations

Software Heritage snapshots are point-in-time; don't reflect current repository state or recent updates

Software Heritage coverage is not 100% — some repositories may not be archived or may have incomplete history

API access to Software Heritage may have rate limits or availability constraints

What makes it unique

Leverages Software Heritage archive as the data source, providing comprehensive open-source code snapshot with reproducible versioning via SWHIDs, independent of GitHub or any single platform

vs alternatives

More comprehensive and platform-independent than GitHub-only datasets; enables reproducible snapshots and includes non-GitHub repositories, while avoiding API rate limits and platform dependency

dataset versioning and release management

Medium confidence

Implements versioning and release management for dataset versions (v1, v2, etc.) with documented changes, improvements, and data quality enhancements between versions. Maintains version-specific documentation, changelog, and reproducibility information. Enables users to select specific dataset versions for training, ensuring reproducibility and allowing comparison of model performance across dataset versions.

Solves for

Enable reproducible model training by specifying exact dataset version usedTrack dataset improvements and quality enhancements across versionsSupport research comparing model performance across different dataset versionsMaintain backward compatibility for existing models while releasing improved versions

Best for

model developers prioritizing reproducibility and version control

researchers comparing model performance across dataset versions

teams maintaining long-term model training pipelines

Requires

Version control and release management infrastructure

Documented changelog and version-specific documentation

Ability to host and serve multiple dataset versions

Limitations

Version management adds complexity to dataset maintenance and storage

Older dataset versions may become outdated or contain deprecated information

Users must explicitly specify dataset version; default behavior may use latest version

What makes it unique

Implements explicit dataset versioning (v1, v2) with documented improvements and reproducibility information, enabling users to specify exact dataset versions for training and supporting reproducible research

vs alternatives

More structured than continuously updated datasets; enables reproducibility and comparison across versions, while providing clear documentation of improvements and changes between releases

hugging face datasets integration and streaming access

Medium confidence

Integrates with Hugging Face Datasets library, providing standardized dataset loading, streaming, and sampling interfaces. Implements dataset cards with documentation, license information, and usage guidelines. Enables efficient streaming access to 67 TB dataset without downloading entire dataset, supporting memory-constrained training environments. Provides dataset splits, sampling strategies, and preprocessing utilities for common training workflows.

Solves for

Load and stream large code dataset efficiently without downloading entire 67 TBIntegrate code dataset into standard Hugging Face training pipelines and toolsAccess dataset documentation, license information, and usage guidelinesSample and preprocess code data for specific training tasks and model architectures

Best for

Hugging Face ecosystem users and transformers library developers

teams using standard ML training frameworks (PyTorch, TensorFlow)

researchers with limited storage but good network connectivity

Requires

Hugging Face Datasets library (Python)

Hugging Face account for dataset access (free tier available)

Network connectivity for streaming access

Limitations

Streaming access requires stable network connectivity; offline training requires full download

Streaming performance depends on network bandwidth and Hugging Face infrastructure availability

Dataset card documentation may not cover all use cases or edge cases

What makes it unique

Native Hugging Face Datasets integration with streaming access to 67 TB dataset, enabling efficient training without full download while providing standardized dataset cards and preprocessing utilities

vs alternatives

More convenient than raw data downloads for Hugging Face users; streaming access reduces storage requirements, while standardized dataset cards provide clear documentation and usage guidelines

training data for starcoder2 and code generation models

Medium confidence

Serves as the primary training dataset for StarCoder2 models and other code generation models. Provides high-quality, permissively-licensed, deduplicated code across 600+ languages with repository context. Enables training of state-of-the-art code LLMs that understand diverse programming paradigms, languages, and coding patterns. Documented as essential resource for reproducing StarCoder2 and training similar models.

Solves for

Train state-of-the-art code generation models comparable to StarCoder2Reproduce StarCoder2 training or fine-tune models on this datasetBuild code understanding models for diverse programming languages and paradigmsEnable research on code model training and evaluation

Best for

model developers training production-grade code LLMs

researchers reproducing or extending StarCoder2 work

organizations building code generation and understanding systems

Requires

Significant computational resources for training (GPUs/TPUs, distributed infrastructure)

ML training framework (PyTorch, TensorFlow, JAX) and expertise

Understanding of code model training procedures and best practices

Limitations

Dataset is optimized for code generation; may not be ideal for other code tasks (e.g., code search, vulnerability detection)

Training on 67 TB requires significant computational resources (GPUs, TPUs, distributed training)

Model quality depends on training procedures, hyperparameters, and infrastructure beyond dataset quality

What makes it unique

Curated and published as the official training dataset for StarCoder2 models, providing permissively-licensed, deduplicated, PII-removed code across 600+ languages with repository context and governance

vs alternatives

More comprehensive and higher-quality than previous code datasets (CodeSearchNet, GitHub-Code) with rigorous deduplication, PII removal, and licensing compliance; enables training of state-of-the-art code models

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with The Stack v2, ranked by overlap. Discovered automatically through the match graph.

Dataset48

StarCoder Data

783 GB curated code dataset from 86 languages with PII redaction.

multi-language code corpus assembly with permissive licensing filteringpersonally identifiable information redaction with multi-pattern detectionnear-deduplication with semantic code similarity detectionexact deduplication with content-addressable storage indexing

4 shared capabilities

API37

Private AI

Multi-modal PII detection and redaction API for 49 languages.

multi-language pii detection with code-switching and non-latin script supportreal-time pii detection across 50+ entity types with multilingual supportdata linking and relationship extraction for connected pii entities

3 shared capabilities

Dataset45

StarCoderData

250GB curated code dataset for StarCoder training.

pii and sensitive data removal with language-aware pattern matchingmulti-language code dataset curation with near-deduplication

2 shared capabilities

Repository24

rehydra

A zero-trust SDK for anonymizing PII locally before sending prompts to LLMs and seamlessly rehydrating the response.

pii-detection-in-structured-data-and-codelocal-pii-anonymization-before-llm-transmission

2 shared capabilities

Model44

Granite

IBM's enterprise-focused open foundation models.

enterprise-grade code generation with license-permissible training data and pii redaction

1 shared capability

API37

Lakera Guard

Real-time prompt injection and LLM threat detection API.

personally identifiable information (pii) leakage detection and prevention

1 shared capability

Best For

✓open-source model developers training code LLMs
✓research teams building code understanding benchmarks
✓organizations committed to open-source licensing compliance
✓model trainers optimizing dataset quality and training efficiency
✓researchers studying code diversity and representation in training data
✓teams building code datasets with limited storage budgets
✓responsible AI practitioners building datasets for public release
✓organizations subject to privacy regulations

Known Limitations

⚠License classification relies on repository metadata and file headers — some licenses may be misclassified or missing
⚠Excludes valuable proprietary code and restrictive licenses (GPL-only, SSPL, etc.) which may limit model diversity
⚠Software Heritage snapshot is point-in-time; doesn't continuously track new repositories or license changes
⚠Deduplication is lossy — removes legitimate variations of common patterns (e.g., multiple implementations of quicksort)
⚠Language-specific deduplication requires parsers for each language; some esoteric languages may use generic string matching
⚠Semantic similarity detection may incorrectly flag functionally equivalent but structurally different code as duplicates

Requirements

Access to Hugging Face datasets library (transformers ecosystem)Sufficient storage for 67 TB dataset or ability to stream/sample subsetsUnderstanding of open-source license categories and compliance requirementsComputational resources for hashing and similarity comparison across 67 TB of dataLanguage-specific parsers or tokenizers for 600+ languagesDeduplication thresholds tuned per language familyPII detection models or pattern libraries (regex, ML classifiers)Manual review or validation of redaction decisions for high-risk cases

Input / Output

Accepts: repository metadata from Software Heritage, license declarations from package managers and repository headers, file content and directory structures, raw source code files, repository metadata, language identifiers, source code files with embedded PII, comments and docstrings, configuration files and secrets, exclusion requests from repository owners, repository identifiers (GitHub URLs, Software Heritage IDs), verification of ownership claims, source code files with language identifiers, file extensions and repository metadata, code content for language detection, repository identifiers and URLs, Software Heritage and GitHub API responses, file-to-repository mappings, Software Heritage API endpoints, repository identifiers and SWHIDs, snapshot references and revision hashes, dataset snapshots and versions, changelog and improvement documentation, version-specific metadata, Hugging Face Datasets API calls, dataset configuration and split specifications, sampling and preprocessing parameters, raw code dataset from Hugging Face Datasets, training configuration and hyperparameters, model architecture specifications

Produces: filtered code files with license annotations, repository-level metadata (language, stars, creation date), license type labels per file/repository, deduplicated code file set, duplicate cluster mappings, deduplication statistics per language, PII-redacted source code, PII detection reports and statistics, redaction audit logs, exclusion list and registry, updated dataset versions with excluded repositories removed, exclusion request status and audit logs, language-annotated code files, language-specific metadata (paradigm, ecosystem, syntax features), language distribution statistics and stratified subsets, enriched repository metadata (stars, forks, creation date, etc.), file-level annotations with repository context, repository quality and activity metrics, code content from Software Heritage archive, repository metadata and version history, snapshot references for reproducibility, versioned dataset releases, version-specific documentation and changelogs, version selection and download mechanisms, streamed code samples and batches, dataset metadata and documentation, preprocessed code tensors for training, trained code generation models, model checkpoints and weights, training logs and evaluation metrics

UnfragileRank

Adoption70%(35% weight)

Quality28%(25% weight)

Ecosystem50%(20% weight)

Match Graph10%(15% weight)

Freshness100%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Dataset

10 capabilities

Visit The Stack v2→

About

BigCode project's 67 TB dataset of permissively licensed source code from Software Heritage archive covering 600+ programming languages. The largest open code dataset available, used to train StarCoder2 models. Includes full file content, repository metadata, and license information. Follows an opt-out governance model allowing repository owners to exclude their code. Rigorous deduplication and PII removal pipeline. Essential resource for training code generation models.

Alternatives to The Stack v2

cua53Agent

Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).

Compare →

Hugging Face43Platform

The GitHub for AI — 500K+ models, datasets, Spaces, Inference API, hub for open-source AI.

Compare →

Stable-Diffusion55Repository

FLUX, Stable Diffusion, SDXL, SD3, LoRA, Fine Tuning, DreamBooth, Training, Automatic1111, Forge WebUI, SwarmUI, DeepFake, TTS, Animation, Text To Video, Tutorials, Guides, Lectures, Courses, ComfyUI, Google Colab, RunPod, Kaggle, NoteBooks, ControlNet, TTS, Voice Cloning, AI, AI News, ML, ML News,

Compare →

YOLOv846Model

Real-time object detection, segmentation, and pose.

Compare →

Are you the builder of The Stack v2?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

seed developer essentials

Looking for something else?

Search →

Capabilities10 decomposed

permissively-licensed source code dataset curation at scale

Medium confidence

Solves for

Best for

open-source model developers training code LLMs

research teams building code understanding benchmarks

organizations committed to open-source licensing compliance

Requires

Access to Hugging Face datasets library (transformers ecosystem)

Sufficient storage for 67 TB dataset or ability to stream/sample subsets

Understanding of open-source license categories and compliance requirements

Limitations

License classification relies on repository metadata and file headers — some licenses may be misclassified or missing

Excludes valuable proprietary code and restrictive licenses (GPL-only, SSPL, etc.) which may limit model diversity

Software Heritage snapshot is point-in-time; doesn't continuously track new repositories or license changes

What makes it unique

vs alternatives

multi-language source code normalization and deduplication

Medium confidence

Solves for

Best for

model trainers optimizing dataset quality and training efficiency

researchers studying code diversity and representation in training data

teams building code datasets with limited storage budgets

Requires

Computational resources for hashing and similarity comparison across 67 TB of data

Language-specific parsers or tokenizers for 600+ languages

Deduplication thresholds tuned per language family

Limitations

Deduplication is lossy — removes legitimate variations of common patterns (e.g., multiple implementations of quicksort)

Language-specific deduplication requires parsers for each language; some esoteric languages may use generic string matching

Semantic similarity detection may incorrectly flag functionally equivalent but structurally different code as duplicates

What makes it unique

vs alternatives

More sophisticated than simple hash-based deduplication used in CodeSearchNet; handles language-specific formatting variations and generated code patterns that generic string matching would miss

personally identifiable information (pii) detection and removal

Medium confidence

Solves for

Best for

responsible AI practitioners building datasets for public release

organizations subject to privacy regulations

model developers concerned about training data privacy and security

Requires

PII detection models or pattern libraries (regex, ML classifiers)

Manual review or validation of redaction decisions for high-risk cases

Ability to parse and modify code while preserving syntax

Limitations

PII detection is probabilistic — some PII may be missed (false negatives) and some non-PII may be incorrectly flagged (false positives)

Context-dependent PII (e.g., usernames that are also common words) is difficult to detect reliably

Redaction may break code functionality if applied to variable names or API endpoints that are semantically important

What makes it unique

vs alternatives

More comprehensive than manual review or simple regex patterns; applies consistent PII removal at scale across diverse code repositories, reducing privacy risks in publicly released training data

opt-out governance model for repository exclusion

Medium confidence

Solves for

Best for

dataset maintainers balancing openness with developer autonomy

organizations building datasets that respect copyright and developer preferences

projects seeking community trust and ethical data practices

Requires

Exclusion request mechanism (email, web form, API endpoint)

Process for verifying repository ownership and exclusion requests

Maintenance of exclusion registry and version control

Limitations

Opt-out is reactive — requires developers to actively request exclusion; many may not be aware their code is included

Processing exclusion requests at scale requires administrative overhead and verification procedures

Exclusion is typically applied to future dataset versions; already-released versions may contain excluded code

What makes it unique

vs alternatives

600+ programming language support with language-specific metadata

Medium confidence

Solves for

Best for

model developers training general-purpose or language-specific code LLMs

researchers studying programming language diversity and code patterns

teams building code analysis tools that need to handle multiple languages

Requires

Language detection library or classifier (e.g., Linguist, tree-sitter)

Language-specific metadata definitions and taxonomies

Sampling and filtering logic to handle language imbalance

Limitations

Language detection is imperfect for polyglot repositories or ambiguous file types

Metadata quality varies across languages — well-known languages have rich metadata, obscure languages may have minimal information

Imbalanced language representation — popular languages (Python, JavaScript) dominate; rare languages have limited samples

What makes it unique

vs alternatives

repository-level metadata enrichment and context preservation

Medium confidence

Solves for

Best for

model developers training context-aware code models

researchers studying code quality and repository metrics

teams building code search and recommendation systems

Requires

Software Heritage API access for repository metadata

GitHub API access for GitHub-hosted repositories (requires authentication)

Data integration and deduplication logic for repositories with multiple sources

Limitations

Metadata is point-in-time snapshot from Software Heritage; doesn't reflect current repository state

GitHub-specific metrics (stars, forks) are not available for non-GitHub repositories

Metadata quality varies — some repositories have incomplete or missing information

What makes it unique

vs alternatives

software heritage archive integration and snapshot access

Medium confidence

Solves for

Best for

large-scale dataset builders requiring comprehensive code access

researchers prioritizing reproducibility and long-term data stability

organizations seeking platform-independent code datasets

Requires

Software Heritage API access and authentication

Data export and processing infrastructure for large-scale archive access

Understanding of Software Heritage data model and identifiers (SWHIDs)

Limitations

Software Heritage snapshots are point-in-time; don't reflect current repository state or recent updates

Software Heritage coverage is not 100% — some repositories may not be archived or may have incomplete history

API access to Software Heritage may have rate limits or availability constraints

What makes it unique

Leverages Software Heritage archive as the data source, providing comprehensive open-source code snapshot with reproducible versioning via SWHIDs, independent of GitHub or any single platform

vs alternatives

More comprehensive and platform-independent than GitHub-only datasets; enables reproducible snapshots and includes non-GitHub repositories, while avoiding API rate limits and platform dependency

dataset versioning and release management

Medium confidence

Solves for

Best for

model developers prioritizing reproducibility and version control

researchers comparing model performance across dataset versions

teams maintaining long-term model training pipelines

Requires

Version control and release management infrastructure

Documented changelog and version-specific documentation

Ability to host and serve multiple dataset versions

Limitations

Version management adds complexity to dataset maintenance and storage

Older dataset versions may become outdated or contain deprecated information

Users must explicitly specify dataset version; default behavior may use latest version

What makes it unique

vs alternatives

More structured than continuously updated datasets; enables reproducibility and comparison across versions, while providing clear documentation of improvements and changes between releases

hugging face datasets integration and streaming access

Medium confidence

Solves for

Best for

Hugging Face ecosystem users and transformers library developers

teams using standard ML training frameworks (PyTorch, TensorFlow)

researchers with limited storage but good network connectivity

Requires

Hugging Face Datasets library (Python)

Hugging Face account for dataset access (free tier available)

Network connectivity for streaming access

Limitations

Streaming access requires stable network connectivity; offline training requires full download

Streaming performance depends on network bandwidth and Hugging Face infrastructure availability

Dataset card documentation may not cover all use cases or edge cases

What makes it unique

vs alternatives

More convenient than raw data downloads for Hugging Face users; streaming access reduces storage requirements, while standardized dataset cards provide clear documentation and usage guidelines

training data for starcoder2 and code generation models

Medium confidence

Solves for

Best for

model developers training production-grade code LLMs

researchers reproducing or extending StarCoder2 work

organizations building code generation and understanding systems

Requires

Significant computational resources for training (GPUs/TPUs, distributed infrastructure)

ML training framework (PyTorch, TensorFlow, JAX) and expertise

Understanding of code model training procedures and best practices

Limitations

Dataset is optimized for code generation; may not be ideal for other code tasks (e.g., code search, vulnerability detection)

Training on 67 TB requires significant computational resources (GPUs, TPUs, distributed training)

Model quality depends on training procedures, hyperparameters, and infrastructure beyond dataset quality

What makes it unique

vs alternatives

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

About

Alternatives to The Stack v2

cua53Agent

Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).

Compare →

Hugging Face43Platform

The GitHub for AI — 500K+ models, datasets, Spaces, Inference API, hub for open-source AI.

Compare →

Stable-Diffusion55Repository

Compare →

YOLOv846Model

Real-time object detection, segmentation, and pose.

Compare →

The Stack v2

Capabilities10 decomposed

permissively-licensed source code dataset curation at scale

multi-language source code normalization and deduplication

personally identifiable information (pii) detection and removal

opt-out governance model for repository exclusion

600+ programming language support with language-specific metadata

repository-level metadata enrichment and context preservation

software heritage archive integration and snapshot access

dataset versioning and release management

hugging face datasets integration and streaming access

training data for starcoder2 and code generation models

Related Artifactssharing capabilities

StarCoder Data

Private AI

StarCoderData

rehydra

Granite

Lakera Guard

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to The Stack v2

Are you the builder of The Stack v2?

Get the weekly brief

Data Sources

The Stack v2

Capabilities10 decomposed

permissively-licensed source code dataset curation at scale

multi-language source code normalization and deduplication

personally identifiable information (pii) detection and removal

opt-out governance model for repository exclusion

600+ programming language support with language-specific metadata

repository-level metadata enrichment and context preservation

software heritage archive integration and snapshot access

dataset versioning and release management

hugging face datasets integration and streaming access

training data for starcoder2 and code generation models

Related Artifactssharing capabilities

StarCoder Data

Private AI

StarCoderData

rehydra

Granite

Lakera Guard

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to The Stack v2

Are you the builder of The Stack v2?

Get the weekly brief

Data Sources