AI Datasets
The data that powers AI — training datasets, evaluation benchmarks, fine-tuning data, and domain-specific corpora hosted on Hugging Face and beyond.
Multilingual code evaluation across 17 languages.
44K pronoun resolution problems testing commonsense understanding.
Allen AI's safety classification dataset and model.
1M+ real user-AI conversations with demographic metadata.
108K images with dense scene graphs and 5.4M region descriptions.
64K preference dataset for RLHF training.
200K high-quality multi-turn dialogues for instruction tuning.
817 adversarial questions measuring model truthfulness vs misconceptions.
95K trivia questions requiring cross-document reasoning.
Microsoft's dataset for implicit toxicity detection.
67 TB permissively licensed code dataset across 600+ languages.
EleutherAI's 825 GiB diverse training dataset from 22 sources.
45K questions requiring reading text in images.
250GB curated code dataset for StarCoder training.
783 GB curated code dataset from 86 languages with PII redaction.
Stanford's 52K GPT-3.5-generated instruction dataset that started it all.
150K reading comprehension questions including unanswerable ones.
1.2M image-text pairs with GPT-4V captions.
Real ChatGPT conversations used to train Vicuna.
11K safety evaluation questions across 7 categories.
BigScience's curated multilingual dataset for BLOOM.
30 trillion token web dataset with 40+ quality signals per document.
100K prompts for evaluating toxic text generation.
Biomedical QA from PubMed abstracts testing evidence-based reasoning.
Massive parallel corpus for machine translation.
161K human-written messages in 35 languages with quality ratings.
183K multi-turn preference comparisons for alignment.
307K real Google Search queries answered from Wikipedia.
330K images with object detection, segmentation, and captions.
57-subject benchmark, the standard metric for comparing LLMs.
12.7K USMLE medical exam questions for clinical AI evaluation.
Multilingual web corpus covering 101 languages.
974 basic Python problems complementing HumanEval for code evaluation.
Enhanced Python coding benchmark with rigorous testing.
12.5K competition math problems across 7 subjects and 5 difficulty levels.
300K instructions extracted directly from aligned LLM outputs.
150K visual instruction examples for multimodal model training.
5.85 billion image-text pairs foundational for image generation.
14M images in 21K categories, the benchmark that launched deep learning.
113K questions requiring multi-hop reasoning across Wikipedia articles.
70K commonsense reasoning questions with adversarial distractors.
Google's 1,836-task instruction mixture for broad generalization.
8.3K financial reasoning questions over real S&P 500 earnings reports.
Hugging Face's 15T token dataset, new standard for LLM training.
1,000 data science problems across 7 Python libraries.
Allen AI's 3T token dataset for fully reproducible LLM training.
6.3T token multilingual dataset across 167 languages.
Largest open web crawl archive, foundation of all LLM training data.
6M functions across 6 languages paired with documentation.
13K competitive programming problems from AlphaCode research.
Multi-turn conversation dataset for steerable models.
Google's cleaned Common Crawl corpus used to train T5.
23 hardest BIG-Bench tasks where models initially failed.
7.8K science questions testing genuine reasoning, not just recall.
10K coding problems across 3 difficulty levels with test suites.
An AI platform providing quality training data for applications like autonomous vehicles and...
Unlock AI potential: vast datasets, cutting-edge models, free access,...
Access, customize high-quality datasets easily; ideal for AI, research, market...
Dataset by huggingface. 24,44,926 downloads.
Dataset by banned-historical-archives. 17,46,771 downloads.
Dataset by Salesforce. 12,11,500 downloads.
Dataset by ropedia-ai. 14,56,180 downloads.
Dataset by mrmrx. 12,02,174 downloads.
Dataset by xlangai. 10,37,848 downloads.
Dataset by nvidia. 10,17,553 downloads.
Dataset by mteb. 10,39,913 downloads.
Dataset by mlfoundations. 10,34,415 downloads.
Dataset by ayuo. 10,53,941 downloads.
Dataset by princeton-nlp. 6,78,148 downloads.
Dataset by openai. 8,22,680 downloads.
Dataset by NTU-NLP-sg. 6,96,087 downloads.
Dataset by mlfoundations. 7,96,577 downloads.
Dataset by mlfoundations. 8,57,357 downloads.
Dataset by allenai. 6,98,456 downloads.
Dataset by ryanmarten. 5,33,474 downloads.
Dataset by mlfoundations. 6,33,111 downloads.
Dataset by mlfoundations. 5,72,108 downloads.
Dataset by mlfoundations. 5,39,406 downloads.
Dataset by m-a-p. 5,55,725 downloads.
Dataset by LLM360. 4,90,092 downloads.
Dataset by lavita. 5,25,534 downloads.
Dataset by Kthera. 5,82,735 downloads.
Dataset by HuggingFaceFW. 6,37,939 downloads.
Dataset by fineinstructions. 5,46,949 downloads.
Dataset by Emmyc2. 5,49,575 downloads.
Dataset by allenai. 5,31,090 downloads.
Dataset by Yarina. 4,13,291 downloads.
Dataset by world-igr-plum. 3,92,732 downloads.
Dataset by rtrm. 4,15,242 downloads.
Dataset by nyu-mll. 3,94,564 downloads.
Dataset by Maynor996. 3,80,160 downloads.
Dataset by HuggingFaceFW. 3,82,017 downloads.
Dataset by Helsinki-NLP. 3,84,377 downloads.
Dataset by cais. 4,39,045 downloads.
Dataset by allenai. 4,06,798 downloads.
Dataset by siril-spcc. 2,99,750 downloads.
Dataset by Rowan. 3,02,975 downloads.
Dataset by robbyant. 2,74,791 downloads.
Dataset by nvidia. 3,34,635 downloads.
Dataset by merve. 3,18,615 downloads.
Top Capabilities
Browse all →Analyzes selected code or entire files and generates natural language explanations of what the code does, how it works, and why certain patterns were chosen. The feature can produce documentation in multiple formats (docstrings, comments, markdown) and supports various documentation styles (JSDoc, Sphinx, etc.). Developers can request explanations at different levels of detail (high-level overview, line-by-line breakdown, architectural context) through the chat interface, with responses appearing as formatted text or code comments.
Translates non-English speech directly to English text using the same Transformer encoder-decoder architecture by prepending a 'translate' task token during decoding, bypassing explicit transcription. The AudioEncoder processes mel spectrograms identically to transcription, but the TextDecoder generates English tokens directly from audio embeddings. This end-to-end approach avoids cascading errors from intermediate transcription-then-translation pipelines and enables language-agnostic audio understanding.
Detects the spoken language in audio by analyzing the AudioEncoder embeddings and using the TextDecoder to predict a language token before generating transcription text. Language detection is implicit in the multitask training; the model learns to identify language from acoustic features without a separate classification head. Supports 99 languages with varying confidence based on training data representation (English: 65% of training data, others: 0.1-2%).
Maintains conversation history within a single chat session, allowing developers to ask follow-up questions, request refinements, and build on previous responses without re-providing context. The extension manages conversation state (messages, responses, context) and sends the full conversation history to ChatGPT's API with each request, enabling contextual understanding of refinement requests like 'make it faster' or 'add error handling'.
Generates new code snippets based on natural language descriptions by sending the user's intent and current editor selection context to OpenAI's API, then inserting the generated code at the cursor position or displaying it in the sidebar. The extension reads the active editor's selected text to provide code context, enabling the model to generate syntactically appropriate code for the detected language. Generation is triggered via keyboard shortcut (Ctrl+Alt+G), command palette, or toolbar button.
Generates docstrings, comments, and API documentation for functions, classes, and modules by analyzing code structure and semantics using GPT-4o. The extension detects function signatures, parameter types, and return types, then generates documentation in multiple formats (JSDoc, Python docstrings, Javadoc, etc.) matching the language and project conventions. Generated docs are inserted inline with proper indentation and formatting.
Analyzes staged or modified code changes in the current Git repository and generates descriptive commit messages using the configured AI provider. The feature integrates with VS Code's Git context to identify changed files and diffs, then sends this information to the AI model to produce commit messages following conventional commit formats or project-specific conventions. This automation reduces the cognitive load of writing commit messages while maintaining code quality and repository history clarity.
Offers a freemium pricing structure where basic problem detection and explanations are available for free, with premium features (likely advanced fix generation, priority support, or higher API quotas) available through paid subscription. The free tier includes GNN-based problem detection and LLM-powered explanations using Metabob's default backend, while premium tiers likely unlock OpenAI ChatGPT integration, higher analysis quotas, or team features. Pricing details are not publicly documented in the marketplace listing.
Browse Other Types
Autonomous AI systems that act on your behalf
ModelsFoundation models, fine-tunes, and specialized AI models
MCP ServersModel Context Protocol tools and integrations
RepositoriesOpen-source AI projects on GitHub
APIsProgrammatic endpoints for AI capabilities
ExtensionsBrowser and IDE extensions powered by AI
View all 14 types →