AI Datasets

The data that powers AI — training datasets, evaluation benchmarks, fine-tuning data, and domain-specific corpora hosted on Hugging Face and beyond.

100 datasets

9 categories

model-training (98)research-search (45)testing-quality (24)data-pipelines (19)code-review-security (6)image-generation (5)rag-knowledge (4)data-analysis (3)automation (1)

100 of 100

xCodeEvalDataset80/100Open Source

Multilingual code evaluation across 17 languages.

·Ranked by freshness 1, adoption 1

WinoGrandeDataset80/100Open Source

44K pronoun resolution problems testing commonsense understanding.

·Ranked by freshness 1, adoption 1

WildGuardDataset80/100Open Source

Allen AI's safety classification dataset and model.

·Ranked by freshness 1, adoption 1

WildChatDataset80/100Open Source

1M+ real user-AI conversations with demographic metadata.

·Ranked by freshness 1, adoption 1

Visual GenomeDataset80/100Open Source

108K images with dense scene graphs and 5.4M region descriptions.

·Ranked by freshness 1, adoption 1

UltraFeedbackDataset80/100Open Source

64K preference dataset for RLHF training.

·Ranked by freshness 1, adoption 1

UltraChat 200KDataset80/100Open Source

200K high-quality multi-turn dialogues for instruction tuning.

·Ranked by freshness 1, adoption 1

TruthfulQADataset80/100Open Source

817 adversarial questions measuring model truthfulness vs misconceptions.

·Ranked by freshness 1, adoption 1

TriviaQADataset80/100Open Source

95K trivia questions requiring cross-document reasoning.

·Ranked by freshness 1, adoption 1

ToxiGenDataset80/100Open Source

Microsoft's dataset for implicit toxicity detection.

·Ranked by freshness 1, adoption 1

The Stack v2Dataset80/100Open Source

67 TB permissively licensed code dataset across 600+ languages.

·Ranked by freshness 1, adoption 1

The PileDataset80/100Open Source

EleutherAI's 825 GiB diverse training dataset from 22 sources.

·Ranked by freshness 1, adoption 1

TextVQADataset80/100Open Source

45K questions requiring reading text in images.

·Ranked by freshness 1, adoption 1

StarCoderDataDataset80/100Open Source

250GB curated code dataset for StarCoder training.

·Ranked by freshness 1, adoption 1

StarCoder DataDataset80/100Open Source

783 GB curated code dataset from 86 languages with PII redaction.

·Ranked by freshness 1, adoption 1

Stanford AlpacaDataset80/100Open Source

Stanford's 52K GPT-3.5-generated instruction dataset that started it all.

·Ranked by freshness 1, adoption 1

SQuAD 2.0Dataset80/100Open Source

150K reading comprehension questions including unanswerable ones.

·Ranked by freshness 1, adoption 1

ShareGPT4VDataset80/100Open Source

1.2M image-text pairs with GPT-4V captions.

·Ranked by freshness 1, adoption 1

ShareGPTDataset80/100Open Source

Real ChatGPT conversations used to train Vicuna.

·Ranked by freshness 1, adoption 1

SafetyBenchDataset80/100Open Source

11K safety evaluation questions across 7 categories.

·Ranked by freshness 1, adoption 1

ROOTSDataset80/100Open Source

BigScience's curated multilingual dataset for BLOOM.

·Ranked by freshness 1, adoption 1

RedPajama v2Dataset80/100Open Source

30 trillion token web dataset with 40+ quality signals per document.

·Ranked by freshness 1, adoption 1

RealToxicityPromptsDataset80/100Open Source

100K prompts for evaluating toxic text generation.

·Ranked by freshness 1, adoption 1

PubMedQADataset80/100Open Source

Biomedical QA from PubMed abstracts testing evidence-based reasoning.

·Ranked by freshness 1, adoption 1

OPUSDataset80/100Open Source

Massive parallel corpus for machine translation.

·Ranked by freshness 1, adoption 1

OpenAssistant Conversations (OASST)Dataset80/100Open Source

161K human-written messages in 35 languages with quality ratings.

·Ranked by freshness 1, adoption 1

NectarDataset80/100Open Source

183K multi-turn preference comparisons for alignment.

·Ranked by freshness 1, adoption 1

Natural QuestionsDataset80/100Open Source

307K real Google Search queries answered from Wikipedia.

·Ranked by freshness 1, adoption 1

MS COCO (Common Objects in Context)Dataset80/100Open Source

330K images with object detection, segmentation, and captions.

·Ranked by freshness 1, adoption 1

MMLU (Massive Multitask Language Understanding)Dataset80/100Open Source

57-subject benchmark, the standard metric for comparing LLMs.

·Ranked by freshness 1, adoption 1

MedQA (USMLE)Dataset80/100Open Source

12.7K USMLE medical exam questions for clinical AI evaluation.

·Ranked by freshness 1, adoption 1

mC4Dataset80/100Open Source

Multilingual web corpus covering 101 languages.

·Ranked by freshness 1, adoption 1

MBPP (Mostly Basic Python Problems)Dataset80/100Open Source

974 basic Python problems complementing HumanEval for code evaluation.

·Ranked by freshness 1, adoption 1

MBPP+Dataset80/100Open Source

Enhanced Python coding benchmark with rigorous testing.

·Ranked by freshness 1, adoption 1

MATHDataset80/100Open Source

12.5K competition math problems across 7 subjects and 5 difficulty levels.

·Ranked by freshness 1, adoption 1

MagpieDataset80/100Open Source

300K instructions extracted directly from aligned LLM outputs.

·Ranked by freshness 1, adoption 1

LLaVA-Instruct 150KDataset80/100Open Source

150K visual instruction examples for multimodal model training.

·Ranked by freshness 1, adoption 1

LAION-5BDataset80/100Open Source

5.85 billion image-text pairs foundational for image generation.

·Ranked by freshness 1, adoption 1

ImageNet (ILSVRC)Dataset80/100Open Source

14M images in 21K categories, the benchmark that launched deep learning.

·Ranked by freshness 1, adoption 1

HotpotQADataset80/100Open Source

113K questions requiring multi-hop reasoning across Wikipedia articles.

·Ranked by freshness 1, adoption 1

HellaSwagDataset80/100Open Source

70K commonsense reasoning questions with adversarial distractors.

·Ranked by freshness 1, adoption 1

FLAN CollectionDataset80/100Open Source

Google's 1,836-task instruction mixture for broad generalization.

·Ranked by freshness 1, adoption 1

FinQADataset80/100Open Source

8.3K financial reasoning questions over real S&P 500 earnings reports.

·Ranked by freshness 1, adoption 1

FineWebDataset80/100Open Source

Hugging Face's 15T token dataset, new standard for LLM training.

·Ranked by freshness 1, adoption 1

DS-1000Dataset80/100Open Source

1,000 data science problems across 7 Python libraries.

·Ranked by freshness 1, adoption 1

DolmaDataset80/100Open Source

Allen AI's 3T token dataset for fully reproducible LLM training.

·Ranked by freshness 1, adoption 1

CulturaXDataset80/100Open Source

6.3T token multilingual dataset across 167 languages.

·Ranked by freshness 1, adoption 1

Common CrawlDataset80/100Open Source

Largest open web crawl archive, foundation of all LLM training data.

·Ranked by freshness 1, adoption 1

CodeSearchNetDataset80/100Open Source

6M functions across 6 languages paired with documentation.

·Ranked by freshness 1, adoption 1

CodeContestsDataset80/100Open Source

13K competitive programming problems from AlphaCode research.

·Ranked by freshness 1, adoption 1

CapybaraDataset80/100Open Source

Multi-turn conversation dataset for steerable models.

·Ranked by freshness 1, adoption 1

C4 (Colossal Clean Crawled Corpus)Dataset80/100Open Source

Google's cleaned Common Crawl corpus used to train T5.

·Ranked by freshness 1, adoption 1

BIG-Bench Hard (BBH)Dataset80/100Open Source

23 hardest BIG-Bench tasks where models initially failed.

·Ranked by freshness 1, adoption 1

ARC (AI2 Reasoning Challenge)Dataset80/100Open Source

7.8K science questions testing genuine reasoning, not just recall.

·Ranked by freshness 1, adoption 1

APPS (Automated Programming Progress Standard)Dataset80/100Open Source

10K coding problems across 3 difficulty levels with test suites.

·Ranked by freshness 1, adoption 1

ScaleDataset78/100

An AI platform providing quality training data for applications like autonomous vehicles and...

14 capabilities·Ranked by freshness 1, quality 1

LaionDataset78/100Free

Unlock AI potential: vast datasets, cutting-edge models, free access,...

9 capabilities·Ranked by freshness 1, quality 0

Dataset MarketplaceDataset78/100Free

Access, customize high-quality datasets easily; ideal for AI, research, market...

10 capabilities·Ranked by freshness 1, quality 0

documentation-imagesDataset66/100Open Source

Dataset by huggingface. 24,44,926 downloads.

6 capabilities·Ranked by freshness 1, ecosystem 1

banned-historical-archivesDataset65/100Open Source

Dataset by banned-historical-archives. 17,46,771 downloads.

6 capabilities·Ranked by freshness 1, ecosystem 1

wikitextDataset64/100Open Source

Dataset by Salesforce. 12,11,500 downloads.

5 capabilities·Ranked by freshness 1, ecosystem 1

xperience-10mDataset64/100Open Source

Dataset by ropedia-ai. 14,56,180 downloads.

6 capabilities·Ranked by freshness 1, ecosystem 1

CADS-datasetDataset64/100Open Source

Dataset by mrmrx. 12,02,174 downloads.

6 capabilities·Ranked by freshness 1, ecosystem 1

ubuntu_osworld_file_cacheDataset63/100Open Source

Dataset by xlangai. 10,37,848 downloads.

5 capabilities·Ranked by freshness 1, ecosystem 0

PhysicalAI-Autonomous-VehiclesDataset63/100Open Source

Dataset by nvidia. 10,17,553 downloads.

5 capabilities·Ranked by freshness 1, ecosystem 0

resultsDataset63/100Open Source

Dataset by mteb. 10,39,913 downloads.

5 capabilities·Ranked by freshness 1, ecosystem 0

MINT-1T-PDF-CC-2024-18Dataset63/100Open Source

Dataset by mlfoundations. 10,34,415 downloads.

6 capabilities·Ranked by freshness 1, ecosystem 1

hd_tmpDataset63/100Open Source

Dataset by ayuo. 10,53,941 downloads.

6 capabilities·Ranked by freshness 1, ecosystem 0

SWE-bench_VerifiedDataset62/100Open Source

Dataset by princeton-nlp. 6,78,148 downloads.

5 capabilities·Ranked by freshness 1, ecosystem 1

gsm8kDataset62/100Open Source

Dataset by openai. 8,22,680 downloads.

5 capabilities·Ranked by freshness 1, ecosystem 1

xCodeEvalDataset62/100Open Source

Dataset by NTU-NLP-sg. 6,96,087 downloads.

7 capabilities·Ranked by freshness 1, ecosystem 1

MINT-1T-PDF-CC-2023-50Dataset62/100Open Source

Dataset by mlfoundations. 7,96,577 downloads.

6 capabilities·Ranked by freshness 1, ecosystem 1

MINT-1T-PDF-CC-2023-40Dataset62/100Open Source

Dataset by mlfoundations. 8,57,357 downloads.

6 capabilities·Ranked by freshness 1, ecosystem 1

c4Dataset62/100Open Source

Dataset by allenai. 6,98,456 downloads.

7 capabilities·Ranked by freshness 1, ecosystem 1

OpenThoughts-1k-sampleDataset61/100Open Source

Dataset by ryanmarten. 5,33,474 downloads.

5 capabilities·Ranked by freshness 1, ecosystem 1

MINT-1T-PDF-CC-2023-23Dataset61/100Open Source

Dataset by mlfoundations. 6,33,111 downloads.

7 capabilities·Ranked by freshness 1, ecosystem 1

MINT-1T-PDF-CC-2023-14Dataset61/100Open Source

Dataset by mlfoundations. 5,72,108 downloads.

6 capabilities·Ranked by freshness 1, ecosystem 1

MINT-1T-PDF-CC-2023-06Dataset61/100Open Source

Dataset by mlfoundations. 5,39,406 downloads.

6 capabilities·Ranked by freshness 1, ecosystem 1

FineFineWebDataset61/100Open Source

Dataset by m-a-p. 5,55,725 downloads.

5 capabilities·Ranked by freshness 1, ecosystem 1

TxT360Dataset61/100Open Source

Dataset by LLM360. 4,90,092 downloads.

5 capabilities·Ranked by freshness 1, ecosystem 1

medical-qa-shared-task-v1-toyDataset61/100Open Source

Dataset by lavita. 5,25,534 downloads.

7 capabilities·Ranked by freshness 1, ecosystem 1

pesozDataset61/100Open Source

Dataset by Kthera. 5,82,735 downloads.

5 capabilities·Ranked by freshness 1, ecosystem 0

finewebDataset61/100Open Source

Dataset by HuggingFaceFW. 6,37,939 downloads.

7 capabilities·Ranked by freshness 1, ecosystem 1

fineinstructions_nemotronDataset61/100Open Source

Dataset by fineinstructions. 5,46,949 downloads.

5 capabilities·Ranked by freshness 1, ecosystem 1

pspDataset61/100Open Source

Dataset by Emmyc2. 5,49,575 downloads.

5 capabilities·Ranked by freshness 1, ecosystem 0

objaverseDataset61/100Open Source

Dataset by allenai. 5,31,090 downloads.

7 capabilities·Ranked by freshness 1, ecosystem 1

Meta_Kaggle_Dataset_Archive_2026-03-12Dataset60/100Open Source

Dataset by Yarina. 4,13,291 downloads.

7 capabilities·Ranked by freshness 1, ecosystem 0

regionsDataset60/100Open Source

Dataset by world-igr-plum. 3,92,732 downloads.

6 capabilities·Ranked by freshness 1, ecosystem 0

debugDataset60/100Open Source

Dataset by rtrm. 4,15,242 downloads.

5 capabilities·Ranked by freshness 1, ecosystem 1

glueDataset60/100Open Source

Dataset by nyu-mll. 3,94,564 downloads.

8 capabilities·Ranked by freshness 1, ecosystem 1

upload2Dataset60/100Open Source

Dataset by Maynor996. 3,80,160 downloads.

6 capabilities·Ranked by freshness 1, ecosystem 1

finephraseDataset60/100Open Source

Dataset by HuggingFaceFW. 3,82,017 downloads.

6 capabilities·Ranked by freshness 1, ecosystem 1

fineweb-edu-translatedDataset60/100Open Source

Dataset by Helsinki-NLP. 3,84,377 downloads.

6 capabilities·Ranked by freshness 1, ecosystem 1

mmluDataset60/100Open Source

Dataset by cais. 4,39,045 downloads.

6 capabilities·Ranked by freshness 1, ecosystem 1

ai2_arcDataset60/100Open Source

Dataset by allenai. 4,06,798 downloads.

6 capabilities·Ranked by freshness 1, ecosystem 1

gaiaDataset59/100Open Source

Dataset by siril-spcc. 2,99,750 downloads.

5 capabilities·Ranked by freshness 1, ecosystem 0

hellaswagDataset59/100Open Source

Dataset by Rowan. 3,02,975 downloads.

8 capabilities·Ranked by freshness 1, ecosystem 1

mdm_depthDataset59/100Open Source

Dataset by robbyant. 2,74,791 downloads.

7 capabilities·Ranked by freshness 1, ecosystem 1

PhysicalAI-Robotics-GR00T-X-Embodiment-SimDataset59/100Open Source

Dataset by nvidia. 3,34,635 downloads.

9 capabilities·Ranked by freshness 1, ecosystem 1

100

vlm_test_imagesDataset59/100Open Source

Dataset by merve. 3,18,615 downloads.

7 capabilities·Ranked by freshness 1, ecosystem 1

Top Capabilities

Browse all →

code explanation and documentation generation11 artifacts

Analyzes selected code or entire files and generates natural language explanations of what the code does, how it works, and why certain patterns were chosen. The feature can produce documentation in multiple formats (docstrings, comments, markdown) and supports various documentation styles (JSDoc, Sphinx, etc.). Developers can request explanations at different levels of detail (high-level overview, line-by-line breakdown, architectural context) through the chat interface, with responses appearing as formatted text or code comments.

ChatGPT AIAI Pundit Magic - Design to Code | Figma to CodeCodeGPT: write and improve code using AI

direct speech-to-english translation without intermediate transcription3 artifacts

Translates non-English speech directly to English text using the same Transformer encoder-decoder architecture by prepending a 'translate' task token during decoding, bypassing explicit transcription. The AudioEncoder processes mel spectrograms identically to transcription, but the TextDecoder generates English tokens directly from audio embeddings. This end-to-end approach avoids cascading errors from intermediate transcription-then-translation pipelines and enables language-agnostic audio understanding.

WhisperWhisper Large v3Whisper CLI

automatic language identification with confidence scoring2 artifacts

Detects the spoken language in audio by analyzing the AudioEncoder embeddings and using the TextDecoder to predict a language token before generating transcription text. Language detection is implicit in the multitask training; the model learns to identify language from acoustic features without a separate classification head. Supports 99 languages with varying confidence based on training data representation (English: 65% of training data, others: 0.1-2%).

WhisperWhisper CLI

multi-turn conversational code assistance2 artifacts

Maintains conversation history within a single chat session, allowing developers to ask follow-up questions, request refinements, and build on previous responses without re-providing context. The extension manages conversation state (messages, responses, context) and sends the full conversation history to ChatGPT's API with each request, enabling contextual understanding of refinement requests like 'make it faster' or 'add error handling'.

ChatGPT AIChatGPT VSCode Plugin

context-aware code generation from natural language2 artifacts

Generates new code snippets based on natural language descriptions by sending the user's intent and current editor selection context to OpenAI's API, then inserting the generated code at the cursor position or displaying it in the sidebar. The extension reads the active editor's selected text to provide code context, enabling the model to generate syntactically appropriate code for the detected language. Generation is triggered via keyboard shortcut (Ctrl+Alt+G), command palette, or toolbar button.

ChatGPT AIRubberduck - ChatGPT for Visual Studio Code

automatic docstring and documentation generation2 artifacts

Generates docstrings, comments, and API documentation for functions, classes, and modules by analyzing code structure and semantics using GPT-4o. The extension detects function signatures, parameter types, and return types, then generates documentation in multiple formats (JSDoc, Python docstrings, Javadoc, etc.) matching the language and project conventions. Generated docs are inserted inline with proper indentation and formatting.

ChatGPT GPT-4o Cursor AI and Copilot, AI Copilot, AI Agent, Code Assistants, and Debugger,Code Chat,Code Completion,Code Generator, Autocomplete, Realtime Code Scanner, Generative AI and Code Search aClaude Opus 4.7, GPT-5.4, Gemini-3.1, Cursor AI, Copilot, Codex,Cline and ChatGPT, AI Copilot, AI Agents and Debugger, Code Assistants, Code Chat, Code Generator, Code Completion, Generative AI, Autoc

git-aware commit message generation from staged changes2 artifacts

Analyzes staged or modified code changes in the current Git repository and generates descriptive commit messages using the configured AI provider. The feature integrates with VS Code's Git context to identify changed files and diffs, then sends this information to the AI model to produce commit messages following conventional commit formats or project-specific conventions. This automation reduces the cognitive load of writing commit messages while maintaining code quality and repository history clarity.

twinny - AI Code Completion and ChatDevChat

freemium pricing model with free tier and premium features2 artifacts

Offers a freemium pricing structure where basic problem detection and explanations are available for free, with premium features (likely advanced fix generation, priority support, or higher API quotas) available through paid subscription. The free tier includes GNN-based problem detection and LLM-powered explanations using Metabob's default backend, while premium tiers likely unlock OpenAI ChatGPT integration, higher analysis quotas, or team features. Pricing details are not publicly documented in the marketplace listing.

Mintlify Doc Writer for Python, JavaScript, TypeScript, C++, PHP, Java, C#, Ruby & moreMetabob: Debug and Refactor with AI

Browse Other Types

Agents

Autonomous AI systems that act on your behalf

Models

Foundation models, fine-tunes, and specialized AI models

MCP Servers

Model Context Protocol tools and integrations

Repositories

Open-source AI projects on GitHub

APIs

Programmatic endpoints for AI capabilities

Extensions

Browser and IDE extensions powered by AI

View all 14 types →

Search the match graph →Submit an artifact