Largest Open Source Dataset For Training Code Generation Models

1

The PileDataset59/100

via “public reproducibility and open-source model training”

EleutherAI's 825 GiB diverse training dataset from 22 sources.

Unique: Provides a large-scale, publicly-available, freely-downloadable pretraining dataset specifically designed for open-source LLM development, enabling full reproducibility and transparency. This contrasts with proprietary datasets (used by OpenAI, Google, Meta) that are not publicly available, or academic datasets that lack the scale and diversity needed for large models. The Pile's influence on subsequent open datasets (C4, RedPajama, etc.) establishes it as a foundational artifact for open-source AI.

vs others: More accessible than proprietary datasets (OpenAI, Google) because it is freely available; more comprehensive than earlier open datasets (WikiText, BookCorpus) because it includes 825 GiB across 22 domains; more influential than contemporary datasets because it established design patterns for open-source LLM training data.

2

Common CrawlDataset59/100

via “open web data archive for model training”

Largest open web crawl archive, foundation of all LLM training data.

Unique: Common Crawl's extensive and regularly updated dataset distinguishes it as a foundational resource for AI and data science.

vs others: Unlike other datasets, Common Crawl offers a vast and continuously refreshed archive of web data, making it unparalleled for large-scale model training.

3

The Stack v2Dataset58/100

via “largest open-source dataset for training code generation models”

67 TB permissively licensed code dataset across 600+ languages.

Unique: This dataset's sheer size and comprehensive coverage of programming languages set it apart from other code datasets.

vs others: Unlike smaller datasets, The Stack v2 offers a vast and diverse collection of code, essential for training robust AI models.

4

DolmaDataset58/100

via “large-scale language model training dataset”

Allen AI's 3T token dataset for fully reproducible LLM training.

Unique: Dolma's unique curation from diverse sources ensures a comprehensive and balanced dataset for effective language model training.

vs others: Unlike other datasets, Dolma offers a massive scale and detailed curation processes that enhance model training outcomes.

5

CodeLlama 70BModel57/100

via “open-source code generation model”

Meta's 70B specialized code generation model.

Unique: It is the largest dedicated open-source model specifically optimized for code generation and understanding.

vs others: CodeLlama 70B stands out for its extensive training on code data and its ability to handle a large context window, surpassing many alternatives in both scale and performance.

6

StarCoderDataDataset57/100

via “curated code dataset for training ai models”

250GB curated code dataset for StarCoder training.

Unique: This dataset is uniquely filtered for quality and privacy, making it ideal for training robust AI models across multiple programming languages.

vs others: Stronger than alternatives due to its extensive curation and focus on quality, ensuring better training outcomes for AI models.

7

Falcon 180BModel57/100

via “large-scale autoregressive text generation with 180b parameters”

TII's 180B model trained on curated RefinedWeb data.

Unique: Largest open-source single-expert (non-MoE) model at release with 180B parameters trained on meticulously cleaned RefinedWeb data (3.5T tokens), achieving competitive reasoning and knowledge performance without mixture-of-experts complexity, enabling deterministic inference patterns and simplified deployment compared to sparse models.

vs others: Larger parameter count than most open-source alternatives (LLaMA 70B, Mistral 8x7B) with claimed GPT-4-competitive reasoning, but requires 2-3x more compute than quantized smaller models and lacks documented instruction-tuning or safety alignment compared to production-ready closed models.

8

CodeContestsDataset57/100

via “competitive-programming-problem-corpus-with-multi-language-solutions”

13K competitive programming problems from AlphaCode research.

Unique: Curated from real competitive programming platforms (Codeforces, AtCoder) with difficulty calibration via median/95th percentile metrics, rather than synthetic or classroom problems. Includes both public and hidden test cases enabling true generalization evaluation, and was specifically constructed to train AlphaCode, making it the largest real-world algorithmic problem corpus for code generation.

vs others: Larger and more algorithmically rigorous than HumanEval or MBPP (which focus on simple utility functions), and more representative of real problem-solving than synthetic benchmarks, while providing standardized difficulty stratification absent from raw Codeforces dumps.

9

Mixtral 8x7BModel57/100

via “code-generation-and-completion”

Mistral's mixture-of-experts model with efficient routing.

Unique: Explicitly documented as having 'strong performance' on code generation tasks with HumanEval benchmark results, achieved through training on code-inclusive datasets and instruction-tuning via SFT + DPO. Sparse routing architecture enables code generation at 6x faster inference speed than dense 70B models.

vs others: Provides open-source code generation with GPT-3.5-level performance and 6x faster inference than Llama 2 70B, enabling self-hosted code completion without reliance on proprietary APIs or external services.

10

LLaVA 1.6Model57/100

via “synthetic-instruction-data-generation-and-curation”

Open multimodal model for visual reasoning.

Unique: First large-scale application of language-only GPT-4 to generate multimodal instruction-following data (158K samples) without human annotation; dataset is publicly released and reproducible, enabling community-driven research on synthetic data quality and effectiveness

vs others: Eliminates annotation costs compared to human-labeled datasets like Visual Genome or Conceptual Captions, while achieving competitive model performance (85.1% relative to GPT-4); enables rapid iteration on model architectures without waiting for manual data labeling

11

Yi-34BModel57/100

via “competitive coding task performance with transformer architecture”

01.AI's bilingual 34B model with 200K context option.

Unique: Achieves competitive coding performance through general-purpose transformer pretraining on 3 trillion tokens without documented code-specific fine-tuning or instruction tuning, suggesting strong code representation learning from raw pretraining data. Bilingual training enables code generation with Chinese comments and documentation.

vs others: Provides competitive coding capability at 34B scale without the specialized training overhead of CodeLlama or Codex, reducing model size and inference cost while maintaining reasonable code quality for non-critical applications.

12

StarCoder DataDataset56/100

via “curated code training dataset for ai models”

783 GB curated code dataset from 86 languages with PII redaction.

Unique: This dataset includes meticulous data processing and an opt-out mechanism for developers, setting it apart from other code datasets.

vs others: Unlike other datasets, StarCoder Data offers a vast and diverse collection of code with a focus on ethical use and developer consent.

13

APPS (Automated Programming Progress Standard)Dataset56/100

via “large-scale evaluation dataset for model benchmarking”

10K coding problems across 3 difficulty levels with test suites.

Unique: Publicly available on Hugging Face with standardized dataset loading interface, enabling reproducible benchmarking across research groups without custom infrastructure, rather than proprietary or difficult-to-access benchmarks

vs others: 10x larger than HumanEval (10K vs 164 problems) with more realistic difficulty distribution and comprehensive test suites, enabling more reliable statistical conclusions about model capabilities

14

LLaVA-Instruct 150KDataset56/100

via “large-scale visual instruction tuning corpus”

150K visual instruction examples for multimodal model training.

Unique: Achieves 150K-example scale through systematic GPT-4V-based generation rather than manual annotation, making large-scale instruction tuning datasets feasible. The scale enables training of models with sufficient data diversity to learn generalizable visual understanding patterns.

vs others: Larger than most manually-annotated visual instruction datasets (COCO is 330K images but fewer instruction examples); more cost-effective than human annotation at scale; enables training of models competitive with larger proprietary datasets through efficient generation.

15

GraniteRepository55/100

via “enterprise-grade code generation models”

IBM's enterprise-focused open foundation models.

Unique: Granite models are specifically trained on enterprise data and support a wide range of programming languages, making them suitable for diverse coding tasks.

vs others: Granite Code Models offer competitive performance and multilingual capabilities compared to other code generation models, particularly for enterprise use.

16

dolphin-2.9.1-yi-1.5-34bModel49/100

via “code generation and understanding across multiple programming languages”

text-generation model by undefined. 47,03,591 downloads.

Unique: Trained on CodeFeedback-Filtered-Instruction (human-curated code quality feedback) and dolphin-coder datasets, enabling the model to generate not just syntactically valid code but code that follows best practices and idioms, rather than generic token-matching approaches used in simpler code completion models

vs others: Generates more idiomatic and maintainable code than base language models due to CodeFeedback training, while remaining fully open-source and deployable locally unlike Copilot; smaller than Codex-scale models but with better instruction-following for code generation tasks

17

Open-Sora-v2Model37/100

via “open-source model architecture and training code accessibility”

text-to-video model by undefined. 16,568 downloads.

Unique: Provides complete training pipeline with distributed training support (DDP, DeepSpeed), configuration management, and evaluation metrics, enabling researchers to reproduce results and fine-tune on custom datasets. Unlike proprietary models (Runway, Pika), full architecture and training code are publicly available for inspection and modification.

vs others: More transparent and customizable than closed-source competitors because full training code is available, and more accessible than academic papers alone because code includes practical implementation details, hyperparameter settings, and dataset preprocessing scripts.

18

xCodeEvalDataset24/100

via “code-to-text generation dataset for documentation and explanation”

Dataset by NTU-NLP-sg. 6,65,024 downloads.

Unique: Combines code snippets with expert-generated natural language descriptions across multiple languages, enabling training of code-to-text models through abstractive and detailed generation formulations — integrates code understanding with natural language generation at scale

vs others: More multilingual and larger than CodeSearchNet's code-to-documentation pairs and includes expert-validated descriptions, whereas most prior datasets rely on mined documentation or single-language focus

19

OpenAI: gpt-oss-120b (free)Model24/100

via “code generation and technical problem-solving”

gpt-oss-120b is an open-weight, 117B-parameter Mixture-of-Experts (MoE) language model from OpenAI designed for high-reasoning, agentic, and general-purpose production use cases. It activates 5.1B parameters per forward pass and is optimized...

Unique: Trained on diverse code repositories with MoE routing that specializes expert networks for different programming paradigms (functional, OOP, procedural); enables language-agnostic code understanding and cross-language pattern transfer

vs others: More cost-effective than GitHub Copilot for batch code generation; comparable code quality to GPT-4 for most languages while maintaining lower latency through sparse activation

20

Qwen 2.5 Coder (1.5B, 3B, 7B, 32B)Model24/100

via “code-generation-from-natural-language-prompts”

Alibaba's Qwen 2.5 specialized for code generation and understanding — code-specialized

Unique: Alibaba's code-specialized training approach combined with Ollama's local-first distribution model enables code generation without sending code to external cloud services. The uniform 32K context window across all model sizes (0.5B-32B) provides consistent context handling, though smaller models may struggle with complex generation tasks.

vs others: Faster than GitHub Copilot for local development workflows because inference runs entirely on-device without cloud round-trips, and more privacy-preserving than OpenAI Codex because generated code never leaves the developer's machine.

Top Matches

Also Known As

Company