{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"natural-questions","slug":"natural-questions","name":"Natural Questions","type":"dataset","url":"https://ai.google.com/research/NaturalQuestions","page_url":"https://unfragile.ai/natural-questions","categories":["rag-knowledge","testing-quality"],"tags":[],"pricing":{"model":"free","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"natural-questions__cap_0","uri":"capability://data.processing.analysis.open.domain.question.answering.evaluation.with.retrieval.comprehension","name":"open-domain question answering evaluation with retrieval + comprehension","description":"Evaluates QA systems on a two-stage pipeline: first retrieving relevant Wikipedia passages from 5.9M articles, then extracting answers from those passages. Unlike single-stage QA benchmarks, Natural Questions forces models to solve both information retrieval (finding the right document/passage) and reading comprehension (extracting the answer) in sequence, measuring end-to-end open-domain QA performance with 307,373 real Google Search queries paired with gold Wikipedia articles and human-annotated answers.","intents":["Benchmark my retrieval-augmented generation system against the standard open-domain QA evaluation","Measure whether my dense retriever can find relevant passages before my reader extracts answers","Compare my QA pipeline's performance on real user queries rather than synthetic questions","Evaluate how well my system handles unanswerable questions that require passage retrieval to determine answerability"],"best_for":["Teams building production RAG systems and open-domain QA pipelines","Researchers evaluating dense retrieval methods (DPR, ColBERT, etc.) and reader models","ML engineers optimizing two-stage QA architectures with separate retrieval and extraction components"],"limitations":["Requires implementing or integrating a retrieval component — benchmark does not provide pre-computed retrieval results, forcing teams to build/tune their own passage ranking","Wikipedia-only corpus may not generalize to domain-specific QA tasks or closed-book settings","Evaluation requires access to full Wikipedia dump (5.9M articles) for retrieval — significant computational overhead for baseline runs","Long answer annotations are paragraph-level, not sentence-level, making fine-grained answer boundary evaluation difficult","No temporal dimension — all questions and Wikipedia snapshots are from 2018, missing evolving information needs"],"requires":["Wikipedia dump (2018 version, ~20GB uncompressed) for retrieval corpus","Retrieval system capable of ranking 5.9M passages (dense retriever, BM25, or hybrid)","Reading comprehension model or span extraction capability","Ability to parse and process JSONL format with nested answer annotations","Computational resources for end-to-end pipeline evaluation (retrieval + reading inference)"],"input_types":["natural language questions (text)","Wikipedia articles (text with structured metadata)"],"output_types":["structured annotations: long answer (paragraph text + start/end offsets), short answer (entity text + token indices), answerability label (yes/no/unknown)"],"categories":["data-processing-analysis","benchmark-dataset"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"natural-questions__cap_1","uri":"capability://data.processing.analysis.real.world.query.distribution.sampling.from.google.search.logs","name":"real-world query distribution sampling from google search logs","description":"Dataset contains 307,373 naturally-occurring questions extracted from anonymized Google Search query logs, preserving the distribution and phrasing of actual user information needs rather than synthetic or crowdsourced questions. Questions span diverse topics, question types (factual, definitional, numerical), and difficulty levels, with natural language variation (typos, fragments, colloquialisms) that synthetic datasets cannot capture. This grounds evaluation in real user behavior and search intent patterns.","intents":["Evaluate my QA system on questions users actually ask rather than crowdsourced or templated questions","Understand how my system performs on the natural language variation and ambiguity present in real search queries","Measure robustness to question phrasing diversity and edge cases that appear in production search logs","Validate that my QA improvements transfer to real user queries, not just benchmark artifacts"],"best_for":["Search engine teams and IR researchers validating QA components against production query distributions","Builders of conversational search systems who need realistic question diversity","Teams evaluating cross-lingual or multilingual QA (Natural Questions has non-English queries)"],"limitations":["Anonymization removes user context and session history — single-turn questions only, no multi-turn dialogue","Google Search log distribution may not reflect other search engines or domain-specific query patterns","Snapshot from 2018 — query language and user intent patterns have evolved (e.g., rise of voice search, mobile queries)","No explicit question type labels (factual vs. definitional vs. numerical) — requires manual categorization for analysis"],"requires":["Ability to parse and process JSONL format with question text and metadata","Understanding of Google Search query conventions and natural language variation"],"input_types":["natural language questions (text) from Google Search logs"],"output_types":["question text with metadata: document title, URL, question ID, annotator agreement metrics"],"categories":["data-processing-analysis","benchmark-dataset"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"natural-questions__cap_2","uri":"capability://data.processing.analysis.dual.level.answer.annotation.with.long.and.short.answer.extraction","name":"dual-level answer annotation with long and short answer extraction","description":"Each question is annotated with two complementary answer types: long answers (paragraph-level passages from Wikipedia, marked with start/end character offsets) and short answers (entity-level spans, marked with token indices). Annotators identify both levels from the same Wikipedia article, or mark the question as unanswerable if no answer exists. This dual annotation enables evaluation of both passage-level retrieval quality (can the system find the right paragraph?) and fine-grained answer extraction (can it identify the exact entity or phrase?).","intents":["Evaluate my retrieval system's ability to rank relevant paragraphs above irrelevant ones","Measure my answer extraction model's precision at identifying exact entity spans within passages","Assess my system's performance on hierarchical answer structures (paragraph context + entity answer)","Determine if my QA pipeline correctly identifies unanswerable questions before attempting extraction"],"best_for":["Teams building two-stage QA systems with separate retrieval and extraction components","Researchers analyzing retrieval vs. extraction error modes independently","Builders of systems that need to return both context (paragraph) and answer (entity) to users"],"limitations":["Long answer annotations are paragraph-level only — no sentence-level boundaries, making fine-grained context evaluation difficult","Short answer annotations are limited to single entity spans — does not handle multi-span answers or complex answer structures","Answerability labels are binary (answerable/unanswerable) — no distinction between 'no answer in Wikipedia' vs. 'answer exists but not in provided article'","Inter-annotator agreement varies by question type — some questions have lower agreement on answer boundaries","Annotation is Wikipedia-specific — answers are constrained to text that exists in Wikipedia, not abstractive or paraphrased answers"],"requires":["Ability to parse and process character-level and token-level span annotations","Wikipedia article text with preserved formatting and offsets for span matching","Evaluation metrics that handle both passage-level and span-level metrics (e.g., F1 for short answers, EM for long answers)"],"input_types":["Wikipedia article text (full document)","Question text"],"output_types":["long answer: paragraph text with character-level start/end offsets","short answer: entity text with token-level start/end indices","answerability: boolean label (answerable/unanswerable)"],"categories":["data-processing-analysis","benchmark-dataset"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"natural-questions__cap_3","uri":"capability://data.processing.analysis.answerability.classification.with.unanswerable.question.handling","name":"answerability classification with unanswerable question handling","description":"Annotators explicitly label each question as answerable or unanswerable based on whether a valid answer exists in the paired Wikipedia article. Unanswerable questions are not simply omitted — they are included in the benchmark with explicit labels, forcing QA systems to learn to recognize when no answer exists rather than always attempting extraction. This tests a critical capability for production systems: rejecting questions outside the knowledge base rather than hallucinating answers.","intents":["Evaluate my QA system's ability to correctly reject unanswerable questions instead of hallucinating answers","Measure precision-recall tradeoffs between answer extraction and answerability detection","Benchmark my system's confidence calibration — does it express uncertainty when no answer exists?","Test robustness to adversarial questions designed to trick QA systems into false answers"],"best_for":["Teams building production QA systems that must handle out-of-domain or unanswerable queries gracefully","Researchers studying hallucination and confidence calibration in QA models","Builders of systems that need to distinguish 'no answer found' from 'answer extraction failed'"],"limitations":["Answerability is binary — no distinction between 'answer not in Wikipedia' vs. 'answer exists but not in this specific article'","Unanswerable questions may have answers in other Wikipedia articles — benchmark only checks the paired article","No explicit adversarial or trick questions — unanswerable questions are naturally occurring, not designed to test specific failure modes","Answerability labels are article-specific — same question might be answerable with a different Wikipedia article"],"requires":["Ability to parse answerability labels from dataset annotations","Evaluation metrics that penalize both false positives (extracting from unanswerable questions) and false negatives (rejecting answerable questions)"],"input_types":["question text","Wikipedia article text"],"output_types":["answerability label: boolean (answerable/unanswerable)","confidence score for answerability prediction"],"categories":["data-processing-analysis","safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"natural-questions__cap_4","uri":"capability://search.retrieval.wikipedia.corpus.indexing.and.passage.ranking.evaluation","name":"wikipedia corpus indexing and passage ranking evaluation","description":"Benchmark includes the full 5.9M Wikipedia article corpus (2018 snapshot) as the retrieval target, requiring systems to rank relevant passages above irrelevant ones. Evaluation measures retrieval performance independently of answer extraction — systems are scored on whether they retrieve the correct Wikipedia article and passage before attempting to extract the answer. This decouples retrieval quality from extraction quality, enabling diagnosis of pipeline failures.","intents":["Benchmark my dense retriever (DPR, ColBERT, etc.) on real open-domain retrieval tasks","Measure retrieval recall@k — what percentage of questions have the correct passage in the top-k results?","Compare different retrieval strategies (BM25, dense retrieval, hybrid) on the same benchmark","Identify retrieval bottlenecks in my QA pipeline before investing in extraction model improvements"],"best_for":["IR researchers and teams building dense retrieval systems","Engineers optimizing retrieval components in RAG pipelines","Teams evaluating passage ranking models (ColBERT, DPR, ANCE, etc.)"],"limitations":["Requires hosting or indexing 5.9M Wikipedia articles — significant computational and storage overhead (~20GB uncompressed, ~5GB indexed)","Benchmark does not provide pre-computed retrieval results — teams must implement and tune their own retrieval system","Retrieval evaluation is limited to Wikipedia articles only — does not test cross-domain or heterogeneous corpus retrieval","No explicit passage-level annotations — retrieval is evaluated at article level, not fine-grained passage level","Wikipedia corpus is static (2018 snapshot) — does not test retrieval on evolving or real-time information"],"requires":["Wikipedia dump (2018 version, ~20GB uncompressed)","Retrieval system capable of indexing and ranking 5.9M documents (dense retriever, BM25, or hybrid)","Ability to compute retrieval metrics (recall@k, MRR, NDCG) against gold Wikipedia articles","Computational resources for retrieval inference (GPU or CPU cluster for dense retrieval)"],"input_types":["question text","Wikipedia article corpus (5.9M articles with text and metadata)"],"output_types":["ranked list of Wikipedia articles with relevance scores","retrieval metrics: recall@k, mean reciprocal rank (MRR), normalized discounted cumulative gain (NDCG)"],"categories":["search-retrieval","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"natural-questions__cap_5","uri":"capability://data.processing.analysis.multi.annotator.agreement.and.answer.quality.assessment","name":"multi-annotator agreement and answer quality assessment","description":"Multiple annotators independently annotate each question with long and short answers, enabling measurement of inter-annotator agreement (IAA) and identification of ambiguous or difficult questions. Benchmark includes agreement metrics (e.g., F1 agreement between annotators) for each question, allowing researchers to filter by agreement level or analyze systematic disagreement patterns. This provides insight into question difficulty and annotation quality.","intents":["Understand which questions are inherently ambiguous or difficult based on annotator disagreement","Filter the benchmark to focus on high-agreement questions for cleaner evaluation","Analyze systematic disagreement patterns to identify annotation artifacts or question ambiguities","Calibrate my evaluation metrics — should I penalize answers that disagree with one annotator but agree with another?"],"best_for":["Researchers analyzing question difficulty and annotation quality","Teams building QA systems that need to understand benchmark reliability","Builders creating filtered subsets of the benchmark for specific evaluation scenarios"],"limitations":["Agreement metrics are computed post-hoc — do not reflect real-time annotation quality control","No explicit disagreement resolution — benchmark includes all annotator answers, not a single gold answer","Agreement is measured at span level — does not capture semantic equivalence (e.g., 'USA' vs. 'United States')","Number of annotators per question is not specified — unclear if all questions have the same number of annotations"],"requires":["Ability to parse multiple annotations per question from dataset","Agreement metrics implementation (F1, exact match, token overlap, etc.)"],"input_types":["multiple annotations per question (long answer, short answer, answerability from different annotators)"],"output_types":["inter-annotator agreement metrics (F1, exact match agreement)","question difficulty scores based on disagreement"],"categories":["data-processing-analysis","testing-quality"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"natural-questions__cap_6","uri":"capability://data.processing.analysis.hierarchical.evaluation.metrics.for.retrieval.and.extraction.stages","name":"hierarchical evaluation metrics for retrieval and extraction stages","description":"Benchmark enables computation of separate evaluation metrics for retrieval and extraction stages: retrieval metrics (recall@k, MRR) measure whether the correct Wikipedia article is ranked highly, while extraction metrics (F1, exact match) measure whether the answer span is correctly identified. Pipeline metrics (end-to-end F1) measure overall QA performance. This modular evaluation approach allows diagnosis of failures at each stage and comparison of different architectural choices.","intents":["Measure my retrieval system's recall@k independently of extraction quality","Compute extraction F1 only on questions where retrieval succeeded, isolating extraction errors","Compare end-to-end QA performance against retrieval-only and extraction-only baselines","Identify whether my QA pipeline is bottlenecked by retrieval or extraction"],"best_for":["Teams optimizing two-stage QA pipelines and diagnosing failure modes","Researchers comparing different retrieval and extraction architectures","Engineers making architectural decisions about retrieval vs. extraction investment"],"limitations":["Metrics are computed independently — does not capture interaction effects between retrieval and extraction errors","Extraction metrics are computed on gold passages — does not measure extraction robustness to retrieval errors","No metrics for answer ranking or confidence calibration — only binary correctness","Metrics assume single-stage retrieval — does not support multi-hop or iterative retrieval evaluation"],"requires":["Ability to compute retrieval metrics (recall@k, MRR) against gold Wikipedia articles","Ability to compute extraction metrics (F1, exact match) against gold answer spans","Evaluation script that handles both stages and computes stage-specific metrics"],"input_types":["predicted answers (retrieved passages and extracted spans)","gold annotations (correct Wikipedia article, long answer, short answer)"],"output_types":["retrieval metrics: recall@k, mean reciprocal rank (MRR)","extraction metrics: F1, exact match (EM)","end-to-end metrics: pipeline F1, pipeline EM"],"categories":["data-processing-analysis","testing-quality"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"natural-questions__cap_7","uri":"capability://data.processing.analysis.cross.domain.generalization.testing.via.wikipedia.article.diversity","name":"cross-domain generalization testing via wikipedia article diversity","description":"Natural Questions spans diverse Wikipedia article categories (science, history, biography, geography, etc.), enabling evaluation of QA system generalization across domains. Questions are paired with articles from different Wikipedia sections, testing whether systems can handle domain-specific terminology, article structures, and information patterns. This provides insight into cross-domain robustness beyond single-domain benchmarks.","intents":["Evaluate my QA system's generalization across diverse Wikipedia domains and article types","Measure performance on domain-specific questions (e.g., scientific, historical, biographical)","Identify domain-specific failure modes or biases in my retrieval or extraction models","Test robustness to different Wikipedia article structures and writing styles"],"best_for":["Teams building general-purpose QA systems that must handle diverse domains","Researchers studying domain adaptation and transfer learning in QA","Builders evaluating whether their QA system is overfitted to specific domains"],"limitations":["Domain labels are not explicitly provided — requires manual categorization or inference from article metadata","Wikipedia article distribution may not reflect real-world information needs across domains","No explicit domain-specific evaluation subsets — requires custom filtering and analysis","Domain-specific terminology and concepts may not be well-represented in general-purpose embeddings or language models"],"requires":["Ability to categorize Wikipedia articles by domain (using article categories or metadata)","Evaluation script that computes metrics per domain for analysis"],"input_types":["questions paired with Wikipedia articles from diverse domains"],"output_types":["per-domain evaluation metrics (F1, EM, recall@k by article category)"],"categories":["data-processing-analysis","testing-quality"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"natural-questions__headline","uri":"capability://data.processing.analysis.open.domain.question.answering.benchmark.dataset","name":"open-domain question answering benchmark dataset","description":"Natural Questions is a comprehensive dataset designed for evaluating open-domain question answering systems, combining real user queries with Wikipedia content to test both information retrieval and reading comprehension.","intents":["best open-domain QA benchmark","open-domain QA dataset for model evaluation","Natural Questions dataset for RAG systems","how to evaluate question answering models","datasets for information retrieval testing"],"best_for":["researchers evaluating QA systems","developers building RAG frameworks"],"limitations":[],"requires":[],"input_types":[],"output_types":[],"categories":["data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":57,"verified":false,"data_access_risk":"high","permissions":["Wikipedia dump (2018 version, ~20GB uncompressed) for retrieval corpus","Retrieval system capable of ranking 5.9M passages (dense retriever, BM25, or hybrid)","Reading comprehension model or span extraction capability","Ability to parse and process JSONL format with nested answer annotations","Computational resources for end-to-end pipeline evaluation (retrieval + reading inference)","Ability to parse and process JSONL format with question text and metadata","Understanding of Google Search query conventions and natural language variation","Ability to parse and process character-level and token-level span annotations","Wikipedia article text with preserved formatting and offsets for span matching","Evaluation metrics that handle both passage-level and span-level metrics (e.g., F1 for short answers, EM for long answers)"],"failure_modes":["Requires implementing or integrating a retrieval component — benchmark does not provide pre-computed retrieval results, forcing teams to build/tune their own passage ranking","Wikipedia-only corpus may not generalize to domain-specific QA tasks or closed-book settings","Evaluation requires access to full Wikipedia dump (5.9M articles) for retrieval — significant computational overhead for baseline runs","Long answer annotations are paragraph-level, not sentence-level, making fine-grained answer boundary evaluation difficult","No temporal dimension — all questions and Wikipedia snapshots are from 2018, missing evolving information needs","Anonymization removes user context and session history — single-turn questions only, no multi-turn dialogue","Google Search log distribution may not reflect other search engines or domain-specific query patterns","Snapshot from 2018 — query language and user intent patterns have evolved (e.g., rise of voice search, mobile queries)","No explicit question type labels (factual vs. definitional vs. numerical) — requires manual categorization for analysis","Long answer annotations are paragraph-level only — no sentence-level boundaries, making fine-grained context evaluation difficult","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.7,"quality":0.8500000000000001,"ecosystem":0.39999999999999997,"match_graph":0.25,"freshness":0.75,"weights":{"adoption":0.3,"quality":0.25,"ecosystem":0.1,"match_graph":0.3,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-05-24T12:16:23.328Z","last_scraped_at":null,"last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=natural-questions","compare_url":"https://unfragile.ai/compare?artifact=natural-questions"}},"signature":"wH+uM9KUC2MaUngvWVaeHcowwP7T63edhhTsUxmXhcoKR9BIXM1tez88jVCyUATwZ3JcQpnNkrK8/BSGcfoOCQ==","signedAt":"2026-06-21T10:17:02.967Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/natural-questions","artifact":"https://unfragile.ai/natural-questions","verify":"https://unfragile.ai/api/v1/verify?slug=natural-questions","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}