AI Datasets
The data that powers AI — training datasets, evaluation benchmarks, fine-tuning data, and domain-specific corpora hosted on Hugging Face and beyond.
67 TB permissively licensed code dataset across 600+ languages.
EleutherAI's 825 GiB diverse training dataset from 22 sources.
30 trillion token web dataset with 40+ quality signals per document.
330K images with object detection, segmentation, and captions.
5.85 billion image-text pairs foundational for image generation.
6.3T token multilingual dataset across 167 languages.
23 hardest BIG-Bench tasks where models initially failed.
95K trivia questions requiring cross-document reasoning.
Microsoft's dataset for implicit toxicity detection.
250GB curated code dataset for StarCoder training.
783 GB curated code dataset from 86 languages with PII redaction.
150K reading comprehension questions including unanswerable ones.
1.2M image-text pairs with GPT-4V captions.
Biomedical QA from PubMed abstracts testing evidence-based reasoning.
Massive parallel corpus for machine translation.
183K multi-turn preference comparisons for alignment.
307K real Google Search queries answered from Wikipedia.
Multilingual web corpus covering 101 languages.
974 basic Python problems complementing HumanEval for code evaluation.
14M images in 21K categories, the benchmark that launched deep learning.
113K questions requiring multi-hop reasoning across Wikipedia articles.
8.3K financial reasoning questions over real S&P 500 earnings reports.
Hugging Face's 15T token dataset, new standard for LLM training.
1,000 data science problems across 7 Python libraries.
Allen AI's 3T token dataset for fully reproducible LLM training.
Largest open web crawl archive, foundation of all LLM training data.
6M functions across 6 languages paired with documentation.
13K competitive programming problems from AlphaCode research.
Multi-turn conversation dataset for steerable models.
7.8K science questions testing genuine reasoning, not just recall.
10K coding problems across 3 difficulty levels with test suites.
44K pronoun resolution problems testing commonsense understanding.
Allen AI's safety classification dataset and model.
1M+ real user-AI conversations with demographic metadata.
64K preference dataset for RLHF training.
200K high-quality multi-turn dialogues for instruction tuning.
817 adversarial questions measuring model truthfulness vs misconceptions.
45K questions requiring reading text in images.
Real ChatGPT conversations used to train Vicuna.
BigScience's curated multilingual dataset for BLOOM.
100K prompts for evaluating toxic text generation.
161K human-written messages in 35 languages with quality ratings.
12.7K USMLE medical exam questions for clinical AI evaluation.
12.5K competition math problems across 7 subjects and 5 difficulty levels.
300K instructions extracted directly from aligned LLM outputs.
150K visual instruction examples for multimodal model training.
70K commonsense reasoning questions with adversarial distractors.
AI annotation platform with medical imaging support.
Google's cleaned Common Crawl corpus used to train T5.
Stanford's 52K GPT-3.5-generated instruction dataset that started it all.
Real-world visual QA requiring spatial reasoning.
8.5K grade school math problems — multi-step reasoning, verifiable solutions, reasoning benchmark.
Google's 1,836-task instruction mixture for broad generalization.
LLM eval and monitoring with hallucination detection.
AI-assisted annotation with auto-labeling for vision.
Access, customize high-quality datasets easily; ideal for AI, research, market...
Truthfulness evaluation: can models answer factually?
Competition mathematics problems (harder than GSM8K)
Commonsense NLI with adversarial context mining
Commonsense reasoning with pronoun resolution
Visual Question Answering with real images and human questions
Mostly Basic Programming Problems (beginner-friendly code)
Grade school math problems requiring multi-step reasoning
Discrete reasoning over paragraphs (numerical reasoning)
Intelligence Aeternum — AI training dataset marketplace with 100,000+ museum artwork images with 4K token .json metadata. Search, preview, and purchase curated art datasets with provenance tracking. Powered by x402 USDC micropayments.
TalkToTables is a database translation and querying tool that utilizes the Chinook dataset available on...
I spent years building a 103B-token Usenet corpus (1980–2013) and finally documented it [P]
HuggingFace community-driven open-source library of datasets
[Slack](https://camel-kwr1314.slack.com/join/shared_invite/zt-1vy8u9lbo-ZQmhIAyWSEfSwLCl2r2eKA#/shared-invite/email)
Dataset by Rowan. 3,02,991 downloads.
Dataset by ropedia-ai. 14,56,180 downloads.
Dataset by nyu-mll. 3,97,160 downloads.
Dataset by mlfoundations. 10,34,415 downloads.
Dataset by mlfoundations. 7,96,577 downloads.
Dataset by mlfoundations. 8,57,357 downloads.
Dataset by mlfoundations. 6,33,111 downloads.
Dataset by mlfoundations. 5,72,108 downloads.
Dataset by mlfoundations. 5,39,406 downloads.
Dataset by robbyant. 3,88,267 downloads.
Dataset by nvidia. 3,55,146 downloads.
Dataset by NTU-NLP-sg. 6,65,024 downloads.
Dataset by mrmrx. 11,96,921 downloads.
Dataset by merve. 2,77,478 downloads.
Dataset by Maynor996. 6,62,770 downloads.
Dataset by lavita. 5,55,826 downloads.
Dataset by HuggingFaceFW. 4,14,812 downloads.
Dataset by HuggingFaceFW. 6,43,166 downloads.
Dataset by HuggingFaceFW. 4,74,259 downloads.
Dataset by huggingface. 25,31,937 downloads.
Dataset by Helsinki-NLP. 3,48,667 downloads.
Dataset by cais. 4,76,392 downloads.
Dataset by cadene. 3,11,762 downloads.
Dataset by bigcode. 4,30,889 downloads.
Dataset by banned-historical-archives. 18,46,708 downloads.
Dataset by allenai. 7,61,810 downloads.
Dataset by allenai. 4,25,151 downloads.
Dataset by Yarina. 4,13,511 downloads.
Dataset by Salesforce. 12,88,015 downloads.
Dataset by ryanmarten. 5,99,055 downloads.
Dataset by rtrm. 3,31,078 downloads.
Top Capabilities
Browse all →Analyzes selected code or entire files and generates natural language explanations of what the code does, how it works, and why certain patterns were chosen. The feature can produce documentation in multiple formats (docstrings, comments, markdown) and supports various documentation styles (JSDoc, Sphinx, etc.). Developers can request explanations at different levels of detail (high-level overview, line-by-line breakdown, architectural context) through the chat interface, with responses appearing as formatted text or code comments.
Cody utilizes a context-aware engine that analyzes the current file and project structure to provide relevant code completions. It integrates with the Visual Studio Code API to access the Abstract Syntax Tree (AST) of the code, allowing it to suggest completions that are semantically relevant to the context, rather than relying solely on keyword matching. This approach ensures that the suggestions are not only syntactically correct but also contextually appropriate, enhancing developer productivity.
Converts natural language prompts into executable full-stack web applications by invoking an AI agent that generates React/Next.js frontend code, Node.js backend logic, and database schemas. The agent runs code in-browser via WebContainers to validate syntax and functionality before deployment, iterating on the generated code based on execution feedback. Token consumption scales with project complexity (larger codebases consume more tokens per iteration), and the agent supports design system imports from Figma and GitHub to accelerate UI generation.
Provides six model variants (tiny, base, small, medium, large, turbo) with parameter counts ranging from 39M to 1550M, enabling developers to choose optimal speed-accuracy tradeoffs. Tiny model runs at ~10x speed with 1GB VRAM; large model runs at 1x speed with 10GB VRAM. English-only variants (tiny.en, base.en, small.en) provide higher English accuracy by removing multilingual capacity. Turbo model (809M params) offers 8x speedup over large with minimal accuracy loss but lacks translation support.
Translates non-English speech directly to English text by using a task-specific token in the TextDecoder that signals translation mode, bypassing the need for intermediate transcription-then-translation pipelines. The AudioEncoder processes mel spectrograms identically to transcription, but the decoder generates English tokens directly from audio embeddings, reducing latency and error propagation compared to cascaded systems.
Transcribes audio in 98 languages to text in the original language using a unified Transformer sequence-to-sequence architecture with a shared AudioEncoder that processes mel spectrograms into language-agnostic embeddings, then a TextDecoder that generates tokens autoregressively. The system handles variable-length audio by padding or trimming to 30-second segments and uses task-specific tokens to signal transcription mode, enabling a single model to handle multiple languages without language-specific branches.
Detects the spoken language in audio by processing mel spectrograms through the AudioEncoder and using a language classification head that outputs probability distributions over 98 supported languages. The model leverages 680K hours of multilingual training data to recognize language characteristics from acoustic features alone, without requiring transcription. Language detection occurs as a preliminary step in the transcription pipeline and can be called independently via the language detection task token.
W&B Personal tier (free) and Enterprise tier support self-hosted deployment via Docker, enabling on-premise installation for teams with data residency or security requirements. Self-hosted instances run independently from W&B cloud, with optional integration to W&B cloud for cross-instance features. Supports custom domain configuration, HTTPS, and integration with corporate identity providers (LDAP, SAML, OAuth).
Browse Other Types
Autonomous AI systems that act on your behalf
ModelsFoundation models, fine-tunes, and specialized AI models
MCP ServersModel Context Protocol tools and integrations
RepositoriesOpen-source AI projects on GitHub
APIsProgrammatic endpoints for AI capabilities
ExtensionsBrowser and IDE extensions powered by AI
View all 19 types →