Capability
3 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →via “largest open-source dataset for training code generation models”
67 TB permissively licensed code dataset across 600+ languages.
Unique: This dataset's sheer size and comprehensive coverage of programming languages set it apart from other code datasets.
vs others: Unlike smaller datasets, The Stack v2 offers a vast and diverse collection of code, essential for training robust AI models.
Open code model trained on 600+ languages.
Unique: Trained on The Stack v2's 600+ language coverage (vs Codex's ~50 languages, Copilot's ~30 languages), enabling code generation for niche languages like Solidity, Kotlin, Rust without separate models
vs others: Broader language support than any competitor; single model handles polyglot codebases vs maintaining separate models per language; better for emerging languages where dedicated models don't exist
via “multi-language code dataset curation with near-deduplication”
250GB curated code dataset for StarCoder training.
Unique: Applies probabilistic near-deduplication at scale across 86 languages with language-aware filtering, rather than simple string matching or language-agnostic hashing. Integrates GitHub issues and commits as additional code context, not just raw source files.
vs others: Larger and more diverse than CodeSearchNet (14 languages, 6M examples) and more aggressively deduplicated than raw The Stack, striking a balance between scale and training efficiency that Codex/GPT-4 datasets don't publicly expose.
Building an AI tool with “Broad Programming Language Coverage Via Stack V2 Training Dataset”?
Submit your artifact →curl unfragile.ai/agents.md | sh© 2026 Unfragile. The platform for software for agents.