What can HotpotQA do?

multi-hop reasoning dataset construction with supporting fact annotation, compositional reasoning evaluation through multi-document retrieval and reasoning chains, supporting fact prediction for explainability evaluation, distractor-based evaluation mode for controlled reasoning assessment, benchmark dataset for evaluating reasoning transparency and answer justification, wikipedia-grounded question generation for domain-specific reasoning

HotpotQA

DatasetFree

113K questions requiring multi-hop reasoning across Wikipedia articles.

Open Source

/ 100

6 capabilities

Capabilities6 decomposed

multi-hop reasoning dataset construction with supporting fact annotation

Medium confidence

Provides 113,000 question-answer pairs where each question requires chaining reasoning across 2+ Wikipedia articles to derive the answer. The dataset includes explicit supporting fact annotations identifying which sentences from source documents are necessary for answering, enabling training of models that can both answer questions and justify their reasoning through evidence selection. Built on Wikipedia snapshots with crowdsourced annotation of answer spans and supporting sentences.

Solves for

Train question-answering models that can reason over multiple documents and explain their answersEvaluate whether QA systems can identify relevant supporting evidence, not just produce correct answersBenchmark compositional reasoning capabilities where answering requires intermediate steps (e.g., entity linking → attribute lookup → final answer)Develop interpretable QA systems where explainability is measured by supporting fact prediction accuracy

Best for

Researchers developing multi-hop QA and reasoning models

Teams building explainable QA systems where supporting evidence is a first-class requirement

ML engineers evaluating whether models can perform compositional reasoning vs memorization

Requires

Hugging Face datasets library (transformers ecosystem integration)

Python 3.7+ for loading and processing

Sufficient disk space (~2GB for full dataset with Wikipedia context)

Limitations

Wikipedia-only source domain — may not generalize to other document types (scientific papers, legal documents, news)

Supporting facts are binary (relevant/irrelevant) rather than ranked by importance — doesn't capture partial relevance

Questions are English-only; no multilingual variants for cross-lingual reasoning evaluation

What makes it unique

Combines answer prediction with supporting fact annotation in a single dataset, enabling joint training of answer generation and evidence selection. Unlike SQuAD (single-document) or MS MARCO (ranking-focused), HotpotQA explicitly requires models to perform intermediate reasoning steps and identify which sentences enable the final answer, making it the first large-scale dataset to measure both answer correctness AND reasoning transparency.

vs alternatives

Uniquely measures explainability through supporting fact prediction rather than just answer accuracy, forcing models to learn which evidence matters rather than memorizing answer patterns from single documents.

compositional reasoning evaluation through multi-document retrieval and reasoning chains

Medium confidence

Enables evaluation of whether QA systems can decompose complex questions into sub-questions, retrieve relevant documents for each step, and chain reasoning across multiple sources. The dataset structure (questions requiring 2+ hops) forces models to learn retrieval-then-reasoning patterns rather than end-to-end memorization. Supports both open-domain (retrieve from full Wikipedia) and distractor-based (retrieve from provided candidates) evaluation modes.

Solves for

Measure whether QA models can perform multi-step reasoning vs single-document lookupEvaluate retrieval quality in open-domain settings by checking if top-k retrieved documents contain supporting factsBenchmark end-to-end systems combining dense/sparse retrieval with reading comprehensionTest whether models learn to identify intermediate entities and attributes needed for final answer

Best for

Researchers developing retrieval-augmented generation (RAG) systems

Teams building open-domain QA systems that need to validate multi-hop retrieval

ML engineers optimizing retrieval-then-read pipelines for complex questions

Requires

Dense retriever (e.g., DPR, ColBERT) or sparse retriever (BM25) for open-domain evaluation

Wikipedia index or pre-computed embeddings for retrieval-based evaluation

Reading comprehension model (BERT-based extractive QA or generative model) for answer prediction

Limitations

Open-domain evaluation requires full Wikipedia index (~20GB+) — computationally expensive for iteration

Distractor mode (provided candidates) doesn't test retrieval, only reading comprehension over given documents

No explicit intermediate question annotations — models must infer reasoning steps from supporting facts alone

What makes it unique

Explicitly structures questions to require intermediate reasoning steps (e.g., 'Who directed film X?' → find film → find director → extract name), forcing evaluation of whether systems learn compositional reasoning vs pattern matching. Supporting fact annotations enable measuring retrieval quality independently from answer correctness, unlike SQuAD where retrieval is implicit.

vs alternatives

Uniquely decouples retrieval evaluation from answer evaluation through supporting fact metrics, revealing whether models retrieve correct evidence even when they produce wrong answers — a diagnostic capability absent from single-document QA benchmarks.

supporting fact prediction for explainability evaluation

Medium confidence

Provides ground-truth supporting fact annotations (sentence-level indices from source documents) enabling training and evaluation of models that predict which evidence is necessary for answering. This enables measuring explainability as a quantitative metric (supporting fact F1/precision/recall) rather than qualitative assessment. Models can be trained jointly on answer prediction and supporting fact prediction, or separately for interpretability analysis.

Solves for

Train models to predict which sentences from documents are necessary for answering questionsEvaluate whether QA systems learn to select relevant evidence or just memorize answer patternsMeasure explainability quality through supporting fact F1 scores as a proxy for reasoning transparencyDevelop interpretable QA systems where predictions come with evidence justification

Best for

Researchers studying interpretability and explainability in QA systems

Teams building trustworthy AI systems where evidence justification is required

ML engineers optimizing joint training of answer prediction and evidence selection

Requires

Sentence tokenization and indexing of Wikipedia articles

Models capable of multi-task learning (answer prediction + supporting fact prediction)

Evaluation harness supporting supporting fact metrics (precision, recall, F1 at sentence level)

Limitations

Supporting facts are binary (relevant/irrelevant) — doesn't capture partial relevance or importance ranking

Sentence-level granularity may be too coarse for fine-grained evidence (e.g., specific phrases within sentences)

Crowdsourced annotations have noise — some supporting facts may be incomplete or over-annotated

What makes it unique

First large-scale QA dataset to include sentence-level supporting fact annotations, enabling quantitative measurement of explainability through supporting fact F1 rather than subjective evaluation. This shifts explainability from a qualitative property to a measurable metric that can be optimized during training.

vs alternatives

Enables explainability as a first-class optimization target (supporting fact F1) rather than an afterthought, unlike SQuAD or MS MARCO where evidence selection is implicit and unmeasured.

distractor-based evaluation mode for controlled reasoning assessment

Medium confidence

Provides a curated set of distractor documents (Wikipedia articles that are topically related but don't contain supporting facts) alongside correct source documents, enabling controlled evaluation of reading comprehension and reasoning without requiring full retrieval. Models receive a fixed set of candidate documents and must identify which contain relevant information and extract answers, isolating reasoning capability from retrieval quality.

Solves for

Evaluate reading comprehension and reasoning without confounding retrieval qualityTest whether models can distinguish relevant from irrelevant documents in a controlled settingBenchmark answer extraction and supporting fact prediction when retrieval is not a bottleneckEnable faster iteration on reasoning models without expensive retrieval infrastructure

Best for

Researchers developing reading comprehension and reasoning models

Teams optimizing answer extraction without retrieval complexity

ML engineers iterating on model architectures without full open-domain setup

Requires

Pre-selected candidate document sets (provided in dataset)

Reading comprehension model capable of multi-document reasoning

Evaluation harness supporting answer and supporting fact metrics

Limitations

Doesn't test retrieval capability — assumes perfect document selection

Distractor documents are static and may not reflect real-world retrieval noise

Models may overfit to distractor selection patterns that don't generalize to open-domain

What makes it unique

Provides curated distractor documents (topically related but non-supporting) rather than random negatives, enabling more realistic evaluation of document relevance judgment. Distractors are selected to be challenging (e.g., same topic, different entity) rather than trivial, forcing models to perform fine-grained reasoning.

vs alternatives

Offers a middle ground between single-document SQuAD (no retrieval challenge) and open-domain evaluation (expensive retrieval), enabling controlled reasoning assessment with realistic document selection difficulty.

benchmark dataset for evaluating reasoning transparency and answer justification

Medium confidence

Serves as a standardized benchmark for measuring both answer correctness and reasoning transparency through supporting fact prediction. The dataset includes train/dev/test splits with consistent evaluation protocols, enabling reproducible comparison of QA systems on their ability to produce correct answers AND identify supporting evidence. Supports multiple evaluation metrics (answer F1, supporting fact F1, combined scores) for comprehensive system assessment.

Solves for

Compare QA systems on both answer accuracy and reasoning transparency using standardized metricsPublish reproducible results on a widely-adopted benchmark for multi-hop reasoningEvaluate whether model improvements in answer accuracy come from better reasoning or spurious correlationsTrack progress on compositional reasoning as a research community

Best for

Researchers publishing QA and reasoning papers with standardized evaluation

Teams benchmarking their systems against published baselines

ML engineers comparing different architectures on a common task

Requires

Hugging Face datasets library for loading

Standard evaluation script (provided in dataset repository)

Python 3.7+ for running evaluation

Limitations

Benchmark is static — doesn't evolve with new reasoning types or adversarial examples

Evaluation metrics (F1, EM) may not capture all aspects of reasoning quality

Leaderboard saturation risk — models may overfit to specific question patterns in the dataset

What makes it unique

Combines answer evaluation with supporting fact evaluation in a single benchmark, forcing systems to be evaluated on both correctness AND transparency. Unlike SQuAD (answer-only) or information retrieval benchmarks (ranking-only), HotpotQA measures the full pipeline of reasoning, retrieval, and justification.

vs alternatives

Uniquely standardizes evaluation of reasoning transparency alongside answer accuracy, enabling reproducible comparison of systems on their ability to justify answers — a capability absent from single-metric benchmarks.

wikipedia-grounded question generation for domain-specific reasoning

Medium confidence

Questions are generated from Wikipedia articles and require reasoning over real-world entities, relationships, and facts. This grounds reasoning in a concrete knowledge domain (Wikipedia) rather than synthetic or template-based questions, enabling evaluation of whether systems can handle real-world complexity. Questions span diverse topics (people, places, films, organizations) and reasoning patterns (attribute lookup, entity linking, relationship chaining).

Solves for

Evaluate QA systems on real-world Wikipedia-based reasoning rather than synthetic templatesTest whether models can handle diverse entity types and relationship patterns from WikipediaDevelop systems that can reason over actual knowledge bases (Wikipedia) rather than abstract examplesBenchmark generalization across different Wikipedia domains (people, films, organizations, etc.)

Best for

Researchers studying reasoning over real-world knowledge bases

Teams building QA systems that must handle diverse entity types and relationships

ML engineers evaluating generalization across Wikipedia domains

Requires

Wikipedia knowledge base (2018 snapshot provided with dataset)

Entity linking capability to map question mentions to Wikipedia articles

Knowledge of Wikipedia structure and article linking patterns

Limitations

Wikipedia-specific — reasoning patterns may not transfer to other knowledge bases (scientific papers, legal documents)

Entity linking is implicit — models must learn to identify entities without explicit entity annotations

Wikipedia facts are static (2018 snapshot) — doesn't test reasoning over evolving knowledge

What makes it unique

Questions are grounded in real Wikipedia entities and relationships rather than synthetic templates, requiring models to handle actual knowledge base complexity (entity disambiguation, relationship chaining, fact lookup). This makes reasoning evaluation more realistic than template-based datasets.

vs alternatives

Grounds reasoning in a real, large-scale knowledge base (Wikipedia) rather than synthetic examples, enabling evaluation of whether systems can handle real-world entity linking and relationship reasoning.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with HotpotQA, ranked by overlap. Discovered automatically through the match graph.

Product18

Build a Reasoning Model (From Scratch)

A guide to building a working reasoning model from the ground up, by Sebastian Raschka.

training data preparation for reasoning tasksinference-time reasoning chain generation and validation

2 shared capabilities

Agent24

Agentset

An open-source platform for building and evaluating RAG and agentic applications. [#opensource](https://github.com/agentset-ai/agentset)

multi-hop-document-reasoning

1 shared capability

Dataset48

TriviaQA

95K trivia questions requiring cross-document reasoning.

multi-hop reasoning evaluation benchmark for information synthesis

1 shared capability

Dataset45

Capybara

Multi-turn conversation dataset for steerable models.

complex reasoning chain extraction and annotation

1 shared capability

Model53

Qwen3-4B

text-generation model by undefined. 72,05,785 downloads.

question-answering with multi-hop reasoning

1 shared capability

Model21

Mistral: Ministral 3 14B 2512

The largest model in the Ministral 3 family, Ministral 3 14B offers frontier capabilities and performance comparable to its larger Mistral Small 3.2 24B counterpart. A powerful and efficient language...

semantic reasoning with chain-of-thought decomposition

1 shared capability

Best For

✓Researchers developing multi-hop QA and reasoning models
✓Teams building explainable QA systems where supporting evidence is a first-class requirement
✓ML engineers evaluating whether models can perform compositional reasoning vs memorization
✓Organizations implementing RAG systems that need to validate retrieval quality through supporting fact metrics
✓Researchers developing retrieval-augmented generation (RAG) systems
✓Teams building open-domain QA systems that need to validate multi-hop retrieval
✓ML engineers optimizing retrieval-then-read pipelines for complex questions
✓Benchmark creators evaluating reasoning capabilities of large language models

Known Limitations

⚠Wikipedia-only source domain — may not generalize to other document types (scientific papers, legal documents, news)
⚠Supporting facts are binary (relevant/irrelevant) rather than ranked by importance — doesn't capture partial relevance
⚠Questions are English-only; no multilingual variants for cross-lingual reasoning evaluation
⚠Static snapshot of Wikipedia from 2018 — entity/fact changes after that date are not reflected
⚠Crowdsourced annotations have inherent noise; inter-annotator agreement not published for all subsets
⚠Open-domain evaluation requires full Wikipedia index (~20GB+) — computationally expensive for iteration

Requirements

Hugging Face datasets library (transformers ecosystem integration)Python 3.7+ for loading and processingSufficient disk space (~2GB for full dataset with Wikipedia context)Understanding of multi-hop reasoning task structure to properly format model inputsDense retriever (e.g., DPR, ColBERT) or sparse retriever (BM25) for open-domain evaluationWikipedia index or pre-computed embeddings for retrieval-based evaluationReading comprehension model (BERT-based extractive QA or generative model) for answer predictionEvaluation harness supporting multi-hop metrics (supporting fact F1, answer F1)

Input / Output

Accepts: Natural language questions (English), Wikipedia article text (source documents), Question-answer pairs with supporting fact indices, Complex natural language questions requiring 2+ reasoning hops, Wikipedia article corpus (for open-domain) or candidate document sets (for distractor mode), Question-answer pairs with supporting fact annotations, Questions and answer spans, Source documents with sentence-level tokenization, Ground-truth supporting fact indices (sentence indices from documents), Questions, Fixed set of candidate Wikipedia articles (2 supporting + distractors), Ground-truth answer spans and supporting fact indices, Train/dev/test split definitions, Model predictions (answers and supporting facts), Natural language questions grounded in Wikipedia entities, Wikipedia article corpus with hyperlinks and structure

Produces: Structured JSON with question, answer, supporting facts (sentence indices), and full document context, Train/dev/test splits with metadata for evaluation, Metrics: answer F1, supporting fact precision/recall, and combined EM scores, Retrieved document rankings with supporting fact coverage metrics, Answer predictions with supporting fact predictions, Evaluation scores: answer F1/EM, supporting fact precision/recall, retrieval recall@k, Predicted supporting fact indices (which sentences are necessary), Supporting fact precision/recall/F1 scores, Joint metrics combining answer accuracy and supporting fact accuracy, Answer predictions (extracted spans), Supporting fact predictions (sentence indices from candidates), Answer F1/EM and supporting fact precision/recall scores, Standardized evaluation metrics: answer F1, answer EM, supporting fact F1, supporting fact precision/recall, Combined metrics (e.g., answer F1 + supporting fact F1 average), Per-question and aggregate scores for comparison, Answers extracted from Wikipedia text, Supporting facts (sentences from Wikipedia articles), Implicit entity and relationship chains

UnfragileRank

Adoption70%(35% weight)

Quality28%(25% weight)

Ecosystem50%(20% weight)

Match Graph10%(15% weight)

Freshness100%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Dataset

6 capabilities

Visit HotpotQA→

About

Multi-hop question answering dataset containing 113,000 questions that require reasoning over two or more Wikipedia articles to answer. Each question includes supporting facts identifying which sentences are necessary for the answer. Tests compositional reasoning: e.g., 'What nationality is the director of film X?' requires finding the film, identifying the director, and looking up their nationality. Supports both answer extraction and explainability evaluation through supporting fact prediction.

Alternatives to HotpotQA

cua53Agent

Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).

Compare →

Hugging Face43Platform

The GitHub for AI — 500K+ models, datasets, Spaces, Inference API, hub for open-source AI.

Compare →

Stable-Diffusion55Repository

FLUX, Stable Diffusion, SDXL, SD3, LoRA, Fine Tuning, DreamBooth, Training, Automatic1111, Forge WebUI, SwarmUI, DeepFake, TTS, Animation, Text To Video, Tutorials, Guides, Lectures, Courses, ComfyUI, Google Colab, RunPod, Kaggle, NoteBooks, ControlNet, TTS, Voice Cloning, AI, AI News, ML, ML News,

Compare →

YOLOv846Model

Real-time object detection, segmentation, and pose.

Compare →

Are you the builder of HotpotQA?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

seed developer essentials

Looking for something else?

Search →

Capabilities6 decomposed

multi-hop reasoning dataset construction with supporting fact annotation

Medium confidence

Solves for

Best for

Researchers developing multi-hop QA and reasoning models

Teams building explainable QA systems where supporting evidence is a first-class requirement

ML engineers evaluating whether models can perform compositional reasoning vs memorization

Requires

Hugging Face datasets library (transformers ecosystem integration)

Python 3.7+ for loading and processing

Sufficient disk space (~2GB for full dataset with Wikipedia context)

Limitations

Wikipedia-only source domain — may not generalize to other document types (scientific papers, legal documents, news)

Supporting facts are binary (relevant/irrelevant) rather than ranked by importance — doesn't capture partial relevance

Questions are English-only; no multilingual variants for cross-lingual reasoning evaluation

What makes it unique

vs alternatives

compositional reasoning evaluation through multi-document retrieval and reasoning chains

Medium confidence

Solves for

Best for

Researchers developing retrieval-augmented generation (RAG) systems

Teams building open-domain QA systems that need to validate multi-hop retrieval

ML engineers optimizing retrieval-then-read pipelines for complex questions

Requires

Dense retriever (e.g., DPR, ColBERT) or sparse retriever (BM25) for open-domain evaluation

Wikipedia index or pre-computed embeddings for retrieval-based evaluation

Reading comprehension model (BERT-based extractive QA or generative model) for answer prediction

Limitations

Open-domain evaluation requires full Wikipedia index (~20GB+) — computationally expensive for iteration

Distractor mode (provided candidates) doesn't test retrieval, only reading comprehension over given documents

No explicit intermediate question annotations — models must infer reasoning steps from supporting facts alone

What makes it unique

vs alternatives

supporting fact prediction for explainability evaluation

Medium confidence

Solves for

Best for

Researchers studying interpretability and explainability in QA systems

Teams building trustworthy AI systems where evidence justification is required

ML engineers optimizing joint training of answer prediction and evidence selection

Requires

Sentence tokenization and indexing of Wikipedia articles

Models capable of multi-task learning (answer prediction + supporting fact prediction)

Evaluation harness supporting supporting fact metrics (precision, recall, F1 at sentence level)

Limitations

Supporting facts are binary (relevant/irrelevant) — doesn't capture partial relevance or importance ranking

Sentence-level granularity may be too coarse for fine-grained evidence (e.g., specific phrases within sentences)

Crowdsourced annotations have noise — some supporting facts may be incomplete or over-annotated

What makes it unique

vs alternatives

Enables explainability as a first-class optimization target (supporting fact F1) rather than an afterthought, unlike SQuAD or MS MARCO where evidence selection is implicit and unmeasured.

distractor-based evaluation mode for controlled reasoning assessment

Medium confidence

Solves for

Best for

Researchers developing reading comprehension and reasoning models

Teams optimizing answer extraction without retrieval complexity

ML engineers iterating on model architectures without full open-domain setup

Requires

Pre-selected candidate document sets (provided in dataset)

Reading comprehension model capable of multi-document reasoning

Evaluation harness supporting answer and supporting fact metrics

Limitations

Doesn't test retrieval capability — assumes perfect document selection

Distractor documents are static and may not reflect real-world retrieval noise

Models may overfit to distractor selection patterns that don't generalize to open-domain

What makes it unique

vs alternatives

benchmark dataset for evaluating reasoning transparency and answer justification

Medium confidence

Solves for

Best for

Researchers publishing QA and reasoning papers with standardized evaluation

Teams benchmarking their systems against published baselines

ML engineers comparing different architectures on a common task

Requires

Hugging Face datasets library for loading

Standard evaluation script (provided in dataset repository)

Python 3.7+ for running evaluation

Limitations

Benchmark is static — doesn't evolve with new reasoning types or adversarial examples

Evaluation metrics (F1, EM) may not capture all aspects of reasoning quality

Leaderboard saturation risk — models may overfit to specific question patterns in the dataset

What makes it unique

vs alternatives

wikipedia-grounded question generation for domain-specific reasoning

Medium confidence

Solves for

Best for

Researchers studying reasoning over real-world knowledge bases

Teams building QA systems that must handle diverse entity types and relationships

ML engineers evaluating generalization across Wikipedia domains

Requires

Wikipedia knowledge base (2018 snapshot provided with dataset)

Entity linking capability to map question mentions to Wikipedia articles

Knowledge of Wikipedia structure and article linking patterns

Limitations

Wikipedia-specific — reasoning patterns may not transfer to other knowledge bases (scientific papers, legal documents)

Entity linking is implicit — models must learn to identify entities without explicit entity annotations

Wikipedia facts are static (2018 snapshot) — doesn't test reasoning over evolving knowledge

What makes it unique

vs alternatives

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

About

Alternatives to HotpotQA

cua53Agent

Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).

Compare →

Hugging Face43Platform

The GitHub for AI — 500K+ models, datasets, Spaces, Inference API, hub for open-source AI.

Compare →

Stable-Diffusion55Repository

Compare →

YOLOv846Model

Real-time object detection, segmentation, and pose.

Compare →

HotpotQA

Capabilities6 decomposed

multi-hop reasoning dataset construction with supporting fact annotation

compositional reasoning evaluation through multi-document retrieval and reasoning chains

supporting fact prediction for explainability evaluation

distractor-based evaluation mode for controlled reasoning assessment

benchmark dataset for evaluating reasoning transparency and answer justification

wikipedia-grounded question generation for domain-specific reasoning

Related Artifactssharing capabilities

Build a Reasoning Model (From Scratch)

Agentset

TriviaQA

Capybara

Qwen3-4B

Mistral: Ministral 3 14B 2512

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to HotpotQA

Are you the builder of HotpotQA?

Get the weekly brief

Data Sources

HotpotQA

Capabilities6 decomposed

multi-hop reasoning dataset construction with supporting fact annotation

compositional reasoning evaluation through multi-document retrieval and reasoning chains

supporting fact prediction for explainability evaluation

distractor-based evaluation mode for controlled reasoning assessment

benchmark dataset for evaluating reasoning transparency and answer justification

wikipedia-grounded question generation for domain-specific reasoning

Related Artifactssharing capabilities

Build a Reasoning Model (From Scratch)

Agentset

TriviaQA

Capybara

Qwen3-4B

Mistral: Ministral 3 14B 2512

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to HotpotQA

Are you the builder of HotpotQA?

Get the weekly brief

Data Sources