What can Natural Questions do?

open-domain question answering evaluation with retrieval + comprehension, real-world query distribution sampling from google search logs, dual-level answer annotation with long and short answer extraction, answerability classification with unanswerable question handling, wikipedia corpus indexing and passage ranking evaluation, multi-annotator agreement and answer quality assessment, hierarchical evaluation metrics for retrieval and extraction stages, cross-domain generalization testing via wikipedia article diversity, open-domain question answering benchmark dataset

Natural Questions

DatasetFree

307K real Google Search queries answered from Wikipedia.

Open Source

signed passport verify →

/ 100

9 capabilities

Best for: open-domain question answering evaluation with retrieval + comprehension, real-world query distribution sampling from google search logs, dual-level answer annotation with long and short answer extraction
Type: Dataset · Free
Score: 57/100
Best alternative: Supabase

Capabilities9 decomposed

open-domain question answering evaluation with retrieval + comprehension

Medium confidence

Evaluates QA systems on a two-stage pipeline: first retrieving relevant Wikipedia passages from 5.9M articles, then extracting answers from those passages. Unlike single-stage QA benchmarks, Natural Questions forces models to solve both information retrieval (finding the right document/passage) and reading comprehension (extracting the answer) in sequence, measuring end-to-end open-domain QA performance with 307,373 real Google Search queries paired with gold Wikipedia articles and human-annotated answers.

Solves for

Benchmark my retrieval-augmented generation system against the standard open-domain QA evaluationMeasure whether my dense retriever can find relevant passages before my reader extracts answersCompare my QA pipeline's performance on real user queries rather than synthetic questionsEvaluate how well my system handles unanswerable questions that require passage retrieval to determine answerability

Best for

Teams building production RAG systems and open-domain QA pipelines

Researchers evaluating dense retrieval methods (DPR, ColBERT, etc.) and reader models

ML engineers optimizing two-stage QA architectures with separate retrieval and extraction components

Requires

Wikipedia dump (2018 version, ~20GB uncompressed) for retrieval corpus

Retrieval system capable of ranking 5.9M passages (dense retriever, BM25, or hybrid)

Reading comprehension model or span extraction capability

Limitations

Requires implementing or integrating a retrieval component — benchmark does not provide pre-computed retrieval results, forcing teams to build/tune their own passage ranking

Wikipedia-only corpus may not generalize to domain-specific QA tasks or closed-book settings

Evaluation requires access to full Wikipedia dump (5.9M articles) for retrieval — significant computational overhead for baseline runs

What makes it unique

Uniquely combines information retrieval and reading comprehension evaluation in a single benchmark by requiring systems to first retrieve relevant passages from 5.9M Wikipedia articles, then extract answers — forcing end-to-end evaluation of both components rather than isolated QA on pre-selected passages like SQuAD

vs alternatives

More realistic than SQuAD (requires passage retrieval) and more scalable than MS MARCO (Wikipedia corpus is cleaner and more structured than web documents), making it the standard for evaluating production RAG systems

real-world query distribution sampling from google search logs

Medium confidence

Dataset contains 307,373 naturally-occurring questions extracted from anonymized Google Search query logs, preserving the distribution and phrasing of actual user information needs rather than synthetic or crowdsourced questions. Questions span diverse topics, question types (factual, definitional, numerical), and difficulty levels, with natural language variation (typos, fragments, colloquialisms) that synthetic datasets cannot capture. This grounds evaluation in real user behavior and search intent patterns.

Solves for

Evaluate my QA system on questions users actually ask rather than crowdsourced or templated questionsUnderstand how my system performs on the natural language variation and ambiguity present in real search queriesMeasure robustness to question phrasing diversity and edge cases that appear in production search logsValidate that my QA improvements transfer to real user queries, not just benchmark artifacts

Best for

Search engine teams and IR researchers validating QA components against production query distributions

Builders of conversational search systems who need realistic question diversity

Teams evaluating cross-lingual or multilingual QA (Natural Questions has non-English queries)

Requires

Ability to parse and process JSONL format with question text and metadata

Understanding of Google Search query conventions and natural language variation

Limitations

Anonymization removes user context and session history — single-turn questions only, no multi-turn dialogue

Google Search log distribution may not reflect other search engines or domain-specific query patterns

Snapshot from 2018 — query language and user intent patterns have evolved (e.g., rise of voice search, mobile queries)

What makes it unique

Sourced directly from anonymized Google Search logs rather than crowdsourced or synthetic generation, preserving natural question phrasing, ambiguity, and the actual distribution of user information needs at scale

vs alternatives

More representative of production search behavior than crowdsourced QA datasets (which exhibit annotation artifacts and unnatural phrasing), and more diverse than templated benchmarks

dual-level answer annotation with long and short answer extraction

Medium confidence

Each question is annotated with two complementary answer types: long answers (paragraph-level passages from Wikipedia, marked with start/end character offsets) and short answers (entity-level spans, marked with token indices). Annotators identify both levels from the same Wikipedia article, or mark the question as unanswerable if no answer exists. This dual annotation enables evaluation of both passage-level retrieval quality (can the system find the right paragraph?) and fine-grained answer extraction (can it identify the exact entity or phrase?).

Solves for

Evaluate my retrieval system's ability to rank relevant paragraphs above irrelevant onesMeasure my answer extraction model's precision at identifying exact entity spans within passagesAssess my system's performance on hierarchical answer structures (paragraph context + entity answer)Determine if my QA pipeline correctly identifies unanswerable questions before attempting extraction

Best for

Teams building two-stage QA systems with separate retrieval and extraction components

Researchers analyzing retrieval vs. extraction error modes independently

Builders of systems that need to return both context (paragraph) and answer (entity) to users

Requires

Ability to parse and process character-level and token-level span annotations

Wikipedia article text with preserved formatting and offsets for span matching

Evaluation metrics that handle both passage-level and span-level metrics (e.g., F1 for short answers, EM for long answers)

Limitations

Long answer annotations are paragraph-level only — no sentence-level boundaries, making fine-grained context evaluation difficult

Short answer annotations are limited to single entity spans — does not handle multi-span answers or complex answer structures

Answerability labels are binary (answerable/unanswerable) — no distinction between 'no answer in Wikipedia' vs. 'answer exists but not in provided article'

What makes it unique

Provides dual-level annotations (paragraph + entity) enabling independent evaluation of retrieval quality and extraction precision, rather than single-level annotations that conflate both stages

vs alternatives

More granular than SQuAD (which only provides short answer spans) and more realistic than synthetic QA pairs, allowing separate measurement of retrieval and extraction components

answerability classification with unanswerable question handling

Medium confidence

Annotators explicitly label each question as answerable or unanswerable based on whether a valid answer exists in the paired Wikipedia article. Unanswerable questions are not simply omitted — they are included in the benchmark with explicit labels, forcing QA systems to learn to recognize when no answer exists rather than always attempting extraction. This tests a critical capability for production systems: rejecting questions outside the knowledge base rather than hallucinating answers.

Solves for

Evaluate my QA system's ability to correctly reject unanswerable questions instead of hallucinating answersMeasure precision-recall tradeoffs between answer extraction and answerability detectionBenchmark my system's confidence calibration — does it express uncertainty when no answer exists?Test robustness to adversarial questions designed to trick QA systems into false answers

Best for

Teams building production QA systems that must handle out-of-domain or unanswerable queries gracefully

Researchers studying hallucination and confidence calibration in QA models

Builders of systems that need to distinguish 'no answer found' from 'answer extraction failed'

Requires

Ability to parse answerability labels from dataset annotations

Evaluation metrics that penalize both false positives (extracting from unanswerable questions) and false negatives (rejecting answerable questions)

Limitations

Answerability is binary — no distinction between 'answer not in Wikipedia' vs. 'answer exists but not in this specific article'

Unanswerable questions may have answers in other Wikipedia articles — benchmark only checks the paired article

No explicit adversarial or trick questions — unanswerable questions are naturally occurring, not designed to test specific failure modes

What makes it unique

Explicitly includes unanswerable questions with labels rather than filtering them out, forcing systems to learn rejection as a valid output rather than always attempting answer extraction

vs alternatives

More realistic than QA benchmarks that only include answerable questions, and directly addresses the hallucination problem that production systems face

wikipedia corpus indexing and passage ranking evaluation

Medium confidence

Benchmark includes the full 5.9M Wikipedia article corpus (2018 snapshot) as the retrieval target, requiring systems to rank relevant passages above irrelevant ones. Evaluation measures retrieval performance independently of answer extraction — systems are scored on whether they retrieve the correct Wikipedia article and passage before attempting to extract the answer. This decouples retrieval quality from extraction quality, enabling diagnosis of pipeline failures.

Solves for

Benchmark my dense retriever (DPR, ColBERT, etc.) on real open-domain retrieval tasksMeasure retrieval recall@k — what percentage of questions have the correct passage in the top-k results?Compare different retrieval strategies (BM25, dense retrieval, hybrid) on the same benchmarkIdentify retrieval bottlenecks in my QA pipeline before investing in extraction model improvements

Best for

IR researchers and teams building dense retrieval systems

Engineers optimizing retrieval components in RAG pipelines

Teams evaluating passage ranking models (ColBERT, DPR, ANCE, etc.)

Requires

Wikipedia dump (2018 version, ~20GB uncompressed)

Retrieval system capable of indexing and ranking 5.9M documents (dense retriever, BM25, or hybrid)

Ability to compute retrieval metrics (recall@k, MRR, NDCG) against gold Wikipedia articles

Limitations

Requires hosting or indexing 5.9M Wikipedia articles — significant computational and storage overhead (~20GB uncompressed, ~5GB indexed)

Benchmark does not provide pre-computed retrieval results — teams must implement and tune their own retrieval system

Retrieval evaluation is limited to Wikipedia articles only — does not test cross-domain or heterogeneous corpus retrieval

What makes it unique

Provides a large-scale open-domain retrieval benchmark with 5.9M Wikipedia articles and real user queries, enabling evaluation of dense retrieval methods on realistic scale and diversity

vs alternatives

Larger and more realistic than MS MARCO (which uses web documents) and more structured than web-scale retrieval benchmarks, making it ideal for evaluating dense retrievers

multi-annotator agreement and answer quality assessment

Medium confidence

Multiple annotators independently annotate each question with long and short answers, enabling measurement of inter-annotator agreement (IAA) and identification of ambiguous or difficult questions. Benchmark includes agreement metrics (e.g., F1 agreement between annotators) for each question, allowing researchers to filter by agreement level or analyze systematic disagreement patterns. This provides insight into question difficulty and annotation quality.

Solves for

Understand which questions are inherently ambiguous or difficult based on annotator disagreementFilter the benchmark to focus on high-agreement questions for cleaner evaluationAnalyze systematic disagreement patterns to identify annotation artifacts or question ambiguitiesCalibrate my evaluation metrics — should I penalize answers that disagree with one annotator but agree with another?

Best for

Researchers analyzing question difficulty and annotation quality

Teams building QA systems that need to understand benchmark reliability

Builders creating filtered subsets of the benchmark for specific evaluation scenarios

Requires

Ability to parse multiple annotations per question from dataset

Agreement metrics implementation (F1, exact match, token overlap, etc.)

Limitations

Agreement metrics are computed post-hoc — do not reflect real-time annotation quality control

No explicit disagreement resolution — benchmark includes all annotator answers, not a single gold answer

Agreement is measured at span level — does not capture semantic equivalence (e.g., 'USA' vs. 'United States')

What makes it unique

Includes explicit inter-annotator agreement metrics for each question, enabling researchers to understand benchmark reliability and filter by agreement level

vs alternatives

More transparent about annotation quality than benchmarks that hide disagreement, allowing researchers to make informed decisions about evaluation methodology

hierarchical evaluation metrics for retrieval and extraction stages

Medium confidence

Benchmark enables computation of separate evaluation metrics for retrieval and extraction stages: retrieval metrics (recall@k, MRR) measure whether the correct Wikipedia article is ranked highly, while extraction metrics (F1, exact match) measure whether the answer span is correctly identified. Pipeline metrics (end-to-end F1) measure overall QA performance. This modular evaluation approach allows diagnosis of failures at each stage and comparison of different architectural choices.

Solves for

Measure my retrieval system's recall@k independently of extraction qualityCompute extraction F1 only on questions where retrieval succeeded, isolating extraction errorsCompare end-to-end QA performance against retrieval-only and extraction-only baselinesIdentify whether my QA pipeline is bottlenecked by retrieval or extraction

Best for

Teams optimizing two-stage QA pipelines and diagnosing failure modes

Researchers comparing different retrieval and extraction architectures

Engineers making architectural decisions about retrieval vs. extraction investment

Requires

Ability to compute retrieval metrics (recall@k, MRR) against gold Wikipedia articles

Ability to compute extraction metrics (F1, exact match) against gold answer spans

Evaluation script that handles both stages and computes stage-specific metrics

Limitations

Metrics are computed independently — does not capture interaction effects between retrieval and extraction errors

Extraction metrics are computed on gold passages — does not measure extraction robustness to retrieval errors

No metrics for answer ranking or confidence calibration — only binary correctness

What makes it unique

Enables separate evaluation of retrieval and extraction stages, allowing researchers to measure stage-specific performance and diagnose pipeline bottlenecks

vs alternatives

More diagnostic than end-to-end QA metrics alone, and more realistic than isolated retrieval or extraction benchmarks

cross-domain generalization testing via wikipedia article diversity

Medium confidence

Natural Questions spans diverse Wikipedia article categories (science, history, biography, geography, etc.), enabling evaluation of QA system generalization across domains. Questions are paired with articles from different Wikipedia sections, testing whether systems can handle domain-specific terminology, article structures, and information patterns. This provides insight into cross-domain robustness beyond single-domain benchmarks.

Solves for

Evaluate my QA system's generalization across diverse Wikipedia domains and article typesMeasure performance on domain-specific questions (e.g., scientific, historical, biographical)Identify domain-specific failure modes or biases in my retrieval or extraction modelsTest robustness to different Wikipedia article structures and writing styles

Best for

Teams building general-purpose QA systems that must handle diverse domains

Researchers studying domain adaptation and transfer learning in QA

Builders evaluating whether their QA system is overfitted to specific domains

Requires

Ability to categorize Wikipedia articles by domain (using article categories or metadata)

Evaluation script that computes metrics per domain for analysis

Limitations

Domain labels are not explicitly provided — requires manual categorization or inference from article metadata

Wikipedia article distribution may not reflect real-world information needs across domains

No explicit domain-specific evaluation subsets — requires custom filtering and analysis

What makes it unique

Spans diverse Wikipedia domains and article types, enabling evaluation of cross-domain generalization rather than single-domain performance

vs alternatives

More diverse than domain-specific QA benchmarks, and more realistic than synthetic benchmarks that don't reflect real Wikipedia article distribution

open-domain question answering benchmark dataset

Medium confidence

Natural Questions is a comprehensive dataset designed for evaluating open-domain question answering systems, combining real user queries with Wikipedia content to test both information retrieval and reading comprehension.

Solves for

best open-domain QA benchmarkopen-domain QA dataset for model evaluationNatural Questions dataset for RAG systemshow to evaluate question answering models+1 more

Best for

researchers evaluating QA systems

developers building RAG frameworks

What makes it unique

This dataset uniquely combines real search queries with Wikipedia articles to assess both retrieval and comprehension capabilities in QA systems.

vs alternatives

Natural Questions stands out as the standard benchmark for open-domain QA, unlike other datasets that may focus solely on retrieval or comprehension.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with Natural Questions, ranked by overlap. Discovered automatically through the match graph.

Dataset23

ai2_arc

Dataset by allenai. 4,25,151 downloads.

open-domain question-answering evaluation framework

1 shared capability

Repository24

privateGPT

Ask questions to your documents without an internet connection, using the power of LLMs.

multi-document-question-answering-with-retrieval

1 shared capability

Model54

Llama-3.2-1B-Instruct

text-generation model by undefined. 61,71,370 downloads.

question-answering with context-aware retrieval integration

1 shared capability

Dataset21

gaia

Dataset by siril-spcc. 3,36,780 downloads.

large-scale web search result dataset curation and annotation

1 shared capability

Model26

Meta: Llama 3.1 70B Instruct

Meta's latest class of model (Llama 3.1) launched with a variety of sizes & flavors. This 70B instruct-tuned version is optimized for high quality dialogue usecases. It has demonstrated strong...

question answering with context and retrieval augmentation

1 shared capability

Model26

Google: Gemma 4 26B A4B (free)

Gemma 4 26B A4B IT is an instruction-tuned Mixture-of-Experts (MoE) model from Google DeepMind. Despite 25.2B total parameters, only 3.8B activate per token during inference — delivering near-31B quality at...

question-answering with context retrieval and synthesis

1 shared capability

Best For

✓Teams building production RAG systems and open-domain QA pipelines
✓Researchers evaluating dense retrieval methods (DPR, ColBERT, etc.) and reader models
✓ML engineers optimizing two-stage QA architectures with separate retrieval and extraction components
✓Search engine teams and IR researchers validating QA components against production query distributions
✓Builders of conversational search systems who need realistic question diversity
✓Teams evaluating cross-lingual or multilingual QA (Natural Questions has non-English queries)
✓Teams building two-stage QA systems with separate retrieval and extraction components
✓Researchers analyzing retrieval vs. extraction error modes independently

Known Limitations

⚠Requires implementing or integrating a retrieval component — benchmark does not provide pre-computed retrieval results, forcing teams to build/tune their own passage ranking
⚠Wikipedia-only corpus may not generalize to domain-specific QA tasks or closed-book settings
⚠Evaluation requires access to full Wikipedia dump (5.9M articles) for retrieval — significant computational overhead for baseline runs
⚠Long answer annotations are paragraph-level, not sentence-level, making fine-grained answer boundary evaluation difficult
⚠No temporal dimension — all questions and Wikipedia snapshots are from 2018, missing evolving information needs
⚠Anonymization removes user context and session history — single-turn questions only, no multi-turn dialogue

Requirements

Wikipedia dump (2018 version, ~20GB uncompressed) for retrieval corpusRetrieval system capable of ranking 5.9M passages (dense retriever, BM25, or hybrid)Reading comprehension model or span extraction capabilityAbility to parse and process JSONL format with nested answer annotationsComputational resources for end-to-end pipeline evaluation (retrieval + reading inference)Ability to parse and process JSONL format with question text and metadataUnderstanding of Google Search query conventions and natural language variationAbility to parse and process character-level and token-level span annotations

Input / Output

Accepts: natural language questions (text), Wikipedia articles (text with structured metadata), natural language questions (text) from Google Search logs, Wikipedia article text (full document), Question text, question text, Wikipedia article text, Wikipedia article corpus (5.9M articles with text and metadata), multiple annotations per question (long answer, short answer, answerability from different annotators), predicted answers (retrieved passages and extracted spans), gold annotations (correct Wikipedia article, long answer, short answer), questions paired with Wikipedia articles from diverse domains

Produces: structured annotations: long answer (paragraph text + start/end offsets), short answer (entity text + token indices), answerability label (yes/no/unknown), question text with metadata: document title, URL, question ID, annotator agreement metrics, long answer: paragraph text with character-level start/end offsets, short answer: entity text with token-level start/end indices, answerability: boolean label (answerable/unanswerable), answerability label: boolean (answerable/unanswerable), confidence score for answerability prediction, ranked list of Wikipedia articles with relevance scores, retrieval metrics: recall@k, mean reciprocal rank (MRR), normalized discounted cumulative gain (NDCG), inter-annotator agreement metrics (F1, exact match agreement), question difficulty scores based on disagreement, retrieval metrics: recall@k, mean reciprocal rank (MRR), extraction metrics: F1, exact match (EM), end-to-end metrics: pipeline F1, pipeline EM, per-domain evaluation metrics (F1, EM, recall@k by article category)

UnfragileRank

Adoption70%(30% weight)

Quality85%(25% weight)

Ecosystem40%(10% weight)

Match Graph25%(30% weight)

Freshness75%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Dataset

9 capabilities

Visit Natural Questions→

About

Google's question answering benchmark containing 307,373 real anonymized queries from Google Search paired with Wikipedia articles. Annotators identify both long answers (paragraph-level) and short answers (entity-level) from the Wikipedia page, or mark the question as unanswerable. Uniquely tests information retrieval + reading comprehension together since models must find relevant passages before extracting answers. The standard benchmark for open-domain QA and RAG system evaluation.

Alternatives to Natural Questions

Supabase80Platform

Open-source Firebase alternative — Postgres + pgvector, auth, storage, edge functions, real-time.

Compare →

Chroma MCP Server54MCP Server

Official Chroma MCP — vector + full-text retrieval and collection management as agent tools.

Compare →

Weaviate76Platform

Open-source vector DB — built-in vectorizers, hybrid search, GraphQL API, multi-tenancy.

Compare →

Qdrant74Platform

Rust-based vector search engine — fast, payload filtering, quantization, horizontal scaling.

Compare →

See all alternatives to Natural Questions→

Are you the builder of Natural Questions?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

seed developer essentials

Looking for something else?

Search →

Capabilities9 decomposed

open-domain question answering evaluation with retrieval + comprehension

Medium confidence

Solves for

Best for

Teams building production RAG systems and open-domain QA pipelines

Researchers evaluating dense retrieval methods (DPR, ColBERT, etc.) and reader models

ML engineers optimizing two-stage QA architectures with separate retrieval and extraction components

Requires

Wikipedia dump (2018 version, ~20GB uncompressed) for retrieval corpus

Retrieval system capable of ranking 5.9M passages (dense retriever, BM25, or hybrid)

Reading comprehension model or span extraction capability

Limitations

Requires implementing or integrating a retrieval component — benchmark does not provide pre-computed retrieval results, forcing teams to build/tune their own passage ranking

Wikipedia-only corpus may not generalize to domain-specific QA tasks or closed-book settings

Evaluation requires access to full Wikipedia dump (5.9M articles) for retrieval — significant computational overhead for baseline runs

What makes it unique

vs alternatives

real-world query distribution sampling from google search logs

Medium confidence

Solves for

Best for

Search engine teams and IR researchers validating QA components against production query distributions

Builders of conversational search systems who need realistic question diversity

Teams evaluating cross-lingual or multilingual QA (Natural Questions has non-English queries)

Requires

Ability to parse and process JSONL format with question text and metadata

Understanding of Google Search query conventions and natural language variation

Limitations

Anonymization removes user context and session history — single-turn questions only, no multi-turn dialogue

Google Search log distribution may not reflect other search engines or domain-specific query patterns

Snapshot from 2018 — query language and user intent patterns have evolved (e.g., rise of voice search, mobile queries)

What makes it unique

vs alternatives

More representative of production search behavior than crowdsourced QA datasets (which exhibit annotation artifacts and unnatural phrasing), and more diverse than templated benchmarks

dual-level answer annotation with long and short answer extraction

Medium confidence

Solves for

Best for

Teams building two-stage QA systems with separate retrieval and extraction components

Researchers analyzing retrieval vs. extraction error modes independently

Builders of systems that need to return both context (paragraph) and answer (entity) to users

Requires

Ability to parse and process character-level and token-level span annotations

Wikipedia article text with preserved formatting and offsets for span matching

Evaluation metrics that handle both passage-level and span-level metrics (e.g., F1 for short answers, EM for long answers)

Limitations

Long answer annotations are paragraph-level only — no sentence-level boundaries, making fine-grained context evaluation difficult

Short answer annotations are limited to single entity spans — does not handle multi-span answers or complex answer structures

Answerability labels are binary (answerable/unanswerable) — no distinction between 'no answer in Wikipedia' vs. 'answer exists but not in provided article'

What makes it unique

Provides dual-level annotations (paragraph + entity) enabling independent evaluation of retrieval quality and extraction precision, rather than single-level annotations that conflate both stages

vs alternatives

More granular than SQuAD (which only provides short answer spans) and more realistic than synthetic QA pairs, allowing separate measurement of retrieval and extraction components

answerability classification with unanswerable question handling

Medium confidence

Solves for

Best for

Teams building production QA systems that must handle out-of-domain or unanswerable queries gracefully

Researchers studying hallucination and confidence calibration in QA models

Builders of systems that need to distinguish 'no answer found' from 'answer extraction failed'

Requires

Ability to parse answerability labels from dataset annotations

Evaluation metrics that penalize both false positives (extracting from unanswerable questions) and false negatives (rejecting answerable questions)

Limitations

Answerability is binary — no distinction between 'answer not in Wikipedia' vs. 'answer exists but not in this specific article'

Unanswerable questions may have answers in other Wikipedia articles — benchmark only checks the paired article

No explicit adversarial or trick questions — unanswerable questions are naturally occurring, not designed to test specific failure modes

What makes it unique

Explicitly includes unanswerable questions with labels rather than filtering them out, forcing systems to learn rejection as a valid output rather than always attempting answer extraction

vs alternatives

More realistic than QA benchmarks that only include answerable questions, and directly addresses the hallucination problem that production systems face

wikipedia corpus indexing and passage ranking evaluation

Medium confidence

Solves for

Best for

IR researchers and teams building dense retrieval systems

Engineers optimizing retrieval components in RAG pipelines

Teams evaluating passage ranking models (ColBERT, DPR, ANCE, etc.)

Requires

Wikipedia dump (2018 version, ~20GB uncompressed)

Retrieval system capable of indexing and ranking 5.9M documents (dense retriever, BM25, or hybrid)

Ability to compute retrieval metrics (recall@k, MRR, NDCG) against gold Wikipedia articles

Limitations

Requires hosting or indexing 5.9M Wikipedia articles — significant computational and storage overhead (~20GB uncompressed, ~5GB indexed)

Benchmark does not provide pre-computed retrieval results — teams must implement and tune their own retrieval system

Retrieval evaluation is limited to Wikipedia articles only — does not test cross-domain or heterogeneous corpus retrieval

What makes it unique

Provides a large-scale open-domain retrieval benchmark with 5.9M Wikipedia articles and real user queries, enabling evaluation of dense retrieval methods on realistic scale and diversity

vs alternatives

Larger and more realistic than MS MARCO (which uses web documents) and more structured than web-scale retrieval benchmarks, making it ideal for evaluating dense retrievers

multi-annotator agreement and answer quality assessment

Medium confidence

Solves for

Best for

Researchers analyzing question difficulty and annotation quality

Teams building QA systems that need to understand benchmark reliability

Builders creating filtered subsets of the benchmark for specific evaluation scenarios

Requires

Ability to parse multiple annotations per question from dataset

Agreement metrics implementation (F1, exact match, token overlap, etc.)

Limitations

Agreement metrics are computed post-hoc — do not reflect real-time annotation quality control

No explicit disagreement resolution — benchmark includes all annotator answers, not a single gold answer

Agreement is measured at span level — does not capture semantic equivalence (e.g., 'USA' vs. 'United States')

What makes it unique

Includes explicit inter-annotator agreement metrics for each question, enabling researchers to understand benchmark reliability and filter by agreement level

vs alternatives

More transparent about annotation quality than benchmarks that hide disagreement, allowing researchers to make informed decisions about evaluation methodology

hierarchical evaluation metrics for retrieval and extraction stages

Medium confidence

Solves for

Best for

Teams optimizing two-stage QA pipelines and diagnosing failure modes

Researchers comparing different retrieval and extraction architectures

Engineers making architectural decisions about retrieval vs. extraction investment

Requires

Ability to compute retrieval metrics (recall@k, MRR) against gold Wikipedia articles

Ability to compute extraction metrics (F1, exact match) against gold answer spans

Evaluation script that handles both stages and computes stage-specific metrics

Limitations

Metrics are computed independently — does not capture interaction effects between retrieval and extraction errors

Extraction metrics are computed on gold passages — does not measure extraction robustness to retrieval errors

No metrics for answer ranking or confidence calibration — only binary correctness

What makes it unique

Enables separate evaluation of retrieval and extraction stages, allowing researchers to measure stage-specific performance and diagnose pipeline bottlenecks

vs alternatives

More diagnostic than end-to-end QA metrics alone, and more realistic than isolated retrieval or extraction benchmarks

cross-domain generalization testing via wikipedia article diversity

Medium confidence

Solves for

Best for

Teams building general-purpose QA systems that must handle diverse domains

Researchers studying domain adaptation and transfer learning in QA

Builders evaluating whether their QA system is overfitted to specific domains

Requires

Ability to categorize Wikipedia articles by domain (using article categories or metadata)

Evaluation script that computes metrics per domain for analysis

Limitations

Domain labels are not explicitly provided — requires manual categorization or inference from article metadata

Wikipedia article distribution may not reflect real-world information needs across domains

No explicit domain-specific evaluation subsets — requires custom filtering and analysis

What makes it unique

Spans diverse Wikipedia domains and article types, enabling evaluation of cross-domain generalization rather than single-domain performance

vs alternatives

More diverse than domain-specific QA benchmarks, and more realistic than synthetic benchmarks that don't reflect real Wikipedia article distribution

open-domain question answering benchmark dataset

Medium confidence

Solves for

best open-domain QA benchmarkopen-domain QA dataset for model evaluationNatural Questions dataset for RAG systemshow to evaluate question answering models+1 more

Best for

researchers evaluating QA systems

developers building RAG frameworks

What makes it unique

This dataset uniquely combines real search queries with Wikipedia articles to assess both retrieval and comprehension capabilities in QA systems.

vs alternatives

Natural Questions stands out as the standard benchmark for open-domain QA, unlike other datasets that may focus solely on retrieval or comprehension.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

About

Alternatives to Natural Questions

Supabase80Platform

Open-source Firebase alternative — Postgres + pgvector, auth, storage, edge functions, real-time.

Compare →

Chroma MCP Server54MCP Server

Official Chroma MCP — vector + full-text retrieval and collection management as agent tools.

Compare →

Weaviate76Platform

Open-source vector DB — built-in vectorizers, hybrid search, GraphQL API, multi-tenancy.

Compare →

Qdrant74Platform

Rust-based vector search engine — fast, payload filtering, quantization, horizontal scaling.

Compare →

See all alternatives to Natural Questions→

Natural Questions

Capabilities9 decomposed

open-domain question answering evaluation with retrieval + comprehension

real-world query distribution sampling from google search logs

dual-level answer annotation with long and short answer extraction

answerability classification with unanswerable question handling

wikipedia corpus indexing and passage ranking evaluation

multi-annotator agreement and answer quality assessment

hierarchical evaluation metrics for retrieval and extraction stages

cross-domain generalization testing via wikipedia article diversity

open-domain question answering benchmark dataset

Related Artifactssharing capabilities

ai2_arc

privateGPT

Llama-3.2-1B-Instruct

gaia

Meta: Llama 3.1 70B Instruct

Google: Gemma 4 26B A4B (free)

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to Natural Questions

Are you the builder of Natural Questions?

Get the weekly brief

Data Sources

Natural Questions

Capabilities9 decomposed

open-domain question answering evaluation with retrieval + comprehension

real-world query distribution sampling from google search logs

dual-level answer annotation with long and short answer extraction

answerability classification with unanswerable question handling

wikipedia corpus indexing and passage ranking evaluation

multi-annotator agreement and answer quality assessment

hierarchical evaluation metrics for retrieval and extraction stages

cross-domain generalization testing via wikipedia article diversity

open-domain question answering benchmark dataset

Related Artifactssharing capabilities

ai2_arc

privateGPT

Llama-3.2-1B-Instruct

gaia

Meta: Llama 3.1 70B Instruct

Google: Gemma 4 26B A4B (free)

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to Natural Questions

Are you the builder of Natural Questions?

Get the weekly brief

Data Sources