Natural Questions
DatasetFree307K real Google Search queries answered from Wikipedia.
Capabilities8 decomposed
open-domain question answering evaluation with retrieval + comprehension
Medium confidenceEvaluates QA systems on a two-stage pipeline: first retrieving relevant Wikipedia passages from 5.9M articles, then extracting answers from those passages. Unlike single-stage QA benchmarks, Natural Questions forces models to solve both information retrieval (finding the right document/passage) and reading comprehension (extracting the answer) in sequence, measuring end-to-end open-domain QA performance with 307,373 real Google Search queries paired with gold Wikipedia articles and human-annotated answers.
Uniquely combines information retrieval and reading comprehension evaluation in a single benchmark by requiring systems to first retrieve relevant passages from 5.9M Wikipedia articles, then extract answers — forcing end-to-end evaluation of both components rather than isolated QA on pre-selected passages like SQuAD
More realistic than SQuAD (requires passage retrieval) and more scalable than MS MARCO (Wikipedia corpus is cleaner and more structured than web documents), making it the standard for evaluating production RAG systems
real-world query distribution sampling from google search logs
Medium confidenceDataset contains 307,373 naturally-occurring questions extracted from anonymized Google Search query logs, preserving the distribution and phrasing of actual user information needs rather than synthetic or crowdsourced questions. Questions span diverse topics, question types (factual, definitional, numerical), and difficulty levels, with natural language variation (typos, fragments, colloquialisms) that synthetic datasets cannot capture. This grounds evaluation in real user behavior and search intent patterns.
Sourced directly from anonymized Google Search logs rather than crowdsourced or synthetic generation, preserving natural question phrasing, ambiguity, and the actual distribution of user information needs at scale
More representative of production search behavior than crowdsourced QA datasets (which exhibit annotation artifacts and unnatural phrasing), and more diverse than templated benchmarks
dual-level answer annotation with long and short answer extraction
Medium confidenceEach question is annotated with two complementary answer types: long answers (paragraph-level passages from Wikipedia, marked with start/end character offsets) and short answers (entity-level spans, marked with token indices). Annotators identify both levels from the same Wikipedia article, or mark the question as unanswerable if no answer exists. This dual annotation enables evaluation of both passage-level retrieval quality (can the system find the right paragraph?) and fine-grained answer extraction (can it identify the exact entity or phrase?).
Provides dual-level annotations (paragraph + entity) enabling independent evaluation of retrieval quality and extraction precision, rather than single-level annotations that conflate both stages
More granular than SQuAD (which only provides short answer spans) and more realistic than synthetic QA pairs, allowing separate measurement of retrieval and extraction components
answerability classification with unanswerable question handling
Medium confidenceAnnotators explicitly label each question as answerable or unanswerable based on whether a valid answer exists in the paired Wikipedia article. Unanswerable questions are not simply omitted — they are included in the benchmark with explicit labels, forcing QA systems to learn to recognize when no answer exists rather than always attempting extraction. This tests a critical capability for production systems: rejecting questions outside the knowledge base rather than hallucinating answers.
Explicitly includes unanswerable questions with labels rather than filtering them out, forcing systems to learn rejection as a valid output rather than always attempting answer extraction
More realistic than QA benchmarks that only include answerable questions, and directly addresses the hallucination problem that production systems face
wikipedia corpus indexing and passage ranking evaluation
Medium confidenceBenchmark includes the full 5.9M Wikipedia article corpus (2018 snapshot) as the retrieval target, requiring systems to rank relevant passages above irrelevant ones. Evaluation measures retrieval performance independently of answer extraction — systems are scored on whether they retrieve the correct Wikipedia article and passage before attempting to extract the answer. This decouples retrieval quality from extraction quality, enabling diagnosis of pipeline failures.
Provides a large-scale open-domain retrieval benchmark with 5.9M Wikipedia articles and real user queries, enabling evaluation of dense retrieval methods on realistic scale and diversity
Larger and more realistic than MS MARCO (which uses web documents) and more structured than web-scale retrieval benchmarks, making it ideal for evaluating dense retrievers
multi-annotator agreement and answer quality assessment
Medium confidenceMultiple annotators independently annotate each question with long and short answers, enabling measurement of inter-annotator agreement (IAA) and identification of ambiguous or difficult questions. Benchmark includes agreement metrics (e.g., F1 agreement between annotators) for each question, allowing researchers to filter by agreement level or analyze systematic disagreement patterns. This provides insight into question difficulty and annotation quality.
Includes explicit inter-annotator agreement metrics for each question, enabling researchers to understand benchmark reliability and filter by agreement level
More transparent about annotation quality than benchmarks that hide disagreement, allowing researchers to make informed decisions about evaluation methodology
hierarchical evaluation metrics for retrieval and extraction stages
Medium confidenceBenchmark enables computation of separate evaluation metrics for retrieval and extraction stages: retrieval metrics (recall@k, MRR) measure whether the correct Wikipedia article is ranked highly, while extraction metrics (F1, exact match) measure whether the answer span is correctly identified. Pipeline metrics (end-to-end F1) measure overall QA performance. This modular evaluation approach allows diagnosis of failures at each stage and comparison of different architectural choices.
Enables separate evaluation of retrieval and extraction stages, allowing researchers to measure stage-specific performance and diagnose pipeline bottlenecks
More diagnostic than end-to-end QA metrics alone, and more realistic than isolated retrieval or extraction benchmarks
cross-domain generalization testing via wikipedia article diversity
Medium confidenceNatural Questions spans diverse Wikipedia article categories (science, history, biography, geography, etc.), enabling evaluation of QA system generalization across domains. Questions are paired with articles from different Wikipedia sections, testing whether systems can handle domain-specific terminology, article structures, and information patterns. This provides insight into cross-domain robustness beyond single-domain benchmarks.
Spans diverse Wikipedia domains and article types, enabling evaluation of cross-domain generalization rather than single-domain performance
More diverse than domain-specific QA benchmarks, and more realistic than synthetic benchmarks that don't reflect real Wikipedia article distribution
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with Natural Questions, ranked by overlap. Discovered automatically through the match graph.
ai2_arc
Dataset by allenai. 4,25,151 downloads.
privateGPT
Ask questions to your documents without an internet connection, using the power of LLMs.
Llama-3.2-1B-Instruct
text-generation model by undefined. 61,71,370 downloads.
gaia
Dataset by siril-spcc. 3,36,780 downloads.
Meta: Llama 3.1 70B Instruct
Meta's latest class of model (Llama 3.1) launched with a variety of sizes & flavors. This 70B instruct-tuned version is optimized for high quality dialogue usecases. It has demonstrated strong...
Google: Gemma 4 26B A4B (free)
Gemma 4 26B A4B IT is an instruction-tuned Mixture-of-Experts (MoE) model from Google DeepMind. Despite 25.2B total parameters, only 3.8B activate per token during inference — delivering near-31B quality at...
Best For
- ✓Teams building production RAG systems and open-domain QA pipelines
- ✓Researchers evaluating dense retrieval methods (DPR, ColBERT, etc.) and reader models
- ✓ML engineers optimizing two-stage QA architectures with separate retrieval and extraction components
- ✓Search engine teams and IR researchers validating QA components against production query distributions
- ✓Builders of conversational search systems who need realistic question diversity
- ✓Teams evaluating cross-lingual or multilingual QA (Natural Questions has non-English queries)
- ✓Teams building two-stage QA systems with separate retrieval and extraction components
- ✓Researchers analyzing retrieval vs. extraction error modes independently
Known Limitations
- ⚠Requires implementing or integrating a retrieval component — benchmark does not provide pre-computed retrieval results, forcing teams to build/tune their own passage ranking
- ⚠Wikipedia-only corpus may not generalize to domain-specific QA tasks or closed-book settings
- ⚠Evaluation requires access to full Wikipedia dump (5.9M articles) for retrieval — significant computational overhead for baseline runs
- ⚠Long answer annotations are paragraph-level, not sentence-level, making fine-grained answer boundary evaluation difficult
- ⚠No temporal dimension — all questions and Wikipedia snapshots are from 2018, missing evolving information needs
- ⚠Anonymization removes user context and session history — single-turn questions only, no multi-turn dialogue
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Google's question answering benchmark containing 307,373 real anonymized queries from Google Search paired with Wikipedia articles. Annotators identify both long answers (paragraph-level) and short answers (entity-level) from the Wikipedia page, or mark the question as unanswerable. Uniquely tests information retrieval + reading comprehension together since models must find relevant passages before extracting answers. The standard benchmark for open-domain QA and RAG system evaluation.
Categories
Alternatives to Natural Questions
Open-source image generation — SD3, SDXL, massive ecosystem of LoRAs, ControlNets, runs locally.
Compare →Are you the builder of Natural Questions?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →