{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"hotpotqa","slug":"hotpotqa","name":"HotpotQA","type":"dataset","url":"https://huggingface.co/datasets/hotpotqa/hotpot_qa","page_url":"https://unfragile.ai/hotpotqa","categories":["model-training"],"tags":[],"pricing":{"model":"free","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"hotpotqa__cap_0","uri":"capability://data.processing.analysis.multi.hop.reasoning.dataset.construction.with.supporting.fact.annotation","name":"multi-hop reasoning dataset construction with supporting fact annotation","description":"Provides 113,000 question-answer pairs where each question requires traversing and reasoning across 2+ Wikipedia articles to derive the answer. The dataset includes explicit supporting fact annotations identifying which sentences from source documents are necessary for answering, enabling training of models that can both answer questions and explain their reasoning chains. Built through crowdsourced annotation with quality control mechanisms to ensure multi-hop reasoning is genuinely required rather than answerable from single documents.","intents":["Train question-answering models that can perform multi-step reasoning over document collections","Evaluate whether QA systems can identify and cite the specific evidence supporting their answers","Benchmark compositional reasoning capabilities where answers require chaining facts across multiple sources","Develop explainability mechanisms that show which source sentences contributed to each answer"],"best_for":["Researchers developing multi-hop QA systems and evaluating reasoning transparency","Teams building RAG systems that need to justify answer provenance across multiple documents","ML engineers training models for complex information retrieval requiring document composition"],"limitations":["Limited to Wikipedia as source domain — may not generalize to other document types or specialized corpora","Supporting fact annotations are human-provided and subject to annotator disagreement on sentence-level boundaries","Questions are English-only; no multilingual variants for cross-lingual reasoning evaluation","Static snapshot of Wikipedia content; links and article structure may have changed since annotation"],"requires":["HuggingFace Datasets library (datasets>=2.0.0) for loading","Python 3.7+ for data processing","Sufficient disk space (~2GB for full dataset with Wikipedia articles)"],"input_types":["Question text (string)","Wikipedia article passages (text)","Answer text (string)"],"output_types":["Structured JSON with question, answer, supporting facts, and article references","Evaluation metrics (F1 score for supporting fact prediction, EM/F1 for answer extraction)"],"categories":["data-processing-analysis","planning-reasoning"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hotpotqa__cap_1","uri":"capability://data.processing.analysis.supporting.fact.prediction.evaluation.framework","name":"supporting fact prediction evaluation framework","description":"Provides a structured evaluation methodology for assessing whether QA systems can correctly identify which source sentences support their answers. The framework compares predicted supporting facts against human-annotated ground truth using precision, recall, and F1 metrics at both sentence and paragraph levels. This enables measurement of reasoning transparency independent of answer correctness, allowing diagnosis of whether a system found the right answer for the right reasons.","intents":["Measure whether QA models identify correct supporting evidence, not just lucky guesses","Evaluate explainability quality by comparing predicted vs. human-identified reasoning chains","Debug QA system failures by determining if wrong answers stem from poor retrieval or poor reasoning","Compare different retrieval and reasoning architectures on their ability to cite evidence"],"best_for":["Researchers evaluating interpretability and explainability of QA systems","Teams building production QA systems where answer justification is required for user trust","Developers comparing different retrieval-augmented generation architectures"],"limitations":["Evaluation assumes sentence-level granularity; may not capture partial relevance or nuanced supporting relationships","Human annotations may contain errors or disagreement on what constitutes sufficient supporting evidence","Metrics are reference-based; cannot evaluate supporting facts for questions with multiple valid reasoning paths","No automatic metric for evaluating whether cited facts are actually sufficient to derive the answer"],"requires":["Predicted supporting facts in same format as ground truth annotations","Python 3.7+ with standard evaluation libraries","Access to original Wikipedia articles for sentence-level matching"],"input_types":["Predicted supporting facts (list of document-sentence pairs)","Ground truth supporting facts (list of document-sentence pairs)","Question and answer text for context"],"output_types":["Precision, recall, F1 scores for supporting fact prediction","Per-question evaluation results with detailed breakdowns","Aggregate statistics across dataset splits"],"categories":["data-processing-analysis","safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hotpotqa__cap_2","uri":"capability://planning.reasoning.compositional.reasoning.benchmark.with.multi.document.retrieval.requirements","name":"compositional reasoning benchmark with multi-document retrieval requirements","description":"Structures questions to require explicit composition of facts across multiple Wikipedia articles, creating a benchmark where naive single-document retrieval fails. Questions are designed such that the answer cannot be found in any single article; instead, the system must retrieve multiple relevant documents, identify the connecting entity or relationship, and synthesize information across them. This tests whether systems can perform true multi-hop reasoning versus pattern matching on single documents.","intents":["Benchmark whether retrieval systems can identify all necessary documents for multi-hop questions","Test whether reasoning systems can compose information across document boundaries","Evaluate whether QA systems degrade gracefully when required documents are not in top-k retrieval results","Compare single-stage vs. multi-stage retrieval architectures on compositional reasoning tasks"],"best_for":["Researchers developing multi-stage retrieval and reasoning pipelines","Teams evaluating RAG systems on complex information synthesis tasks","Developers benchmarking whether their systems perform genuine reasoning vs. surface-level matching"],"limitations":["All questions are answerable from Wikipedia; does not test reasoning over conflicting or uncertain information","Question types are limited to specific patterns (e.g., 'What is X's Y?'); does not cover all reasoning types","Requires access to full Wikipedia corpus for proper evaluation; subset evaluation may not reflect true multi-hop difficulty","Supporting fact annotations may not capture all valid reasoning paths, potentially penalizing alternative correct approaches"],"requires":["Full Wikipedia article corpus or access to Wikipedia API","Multi-document retrieval system capable of ranking and combining results from multiple articles","Reasoning module that can identify connections between retrieved documents"],"input_types":["Natural language questions (string)","Wikipedia article corpus (text documents with metadata)"],"output_types":["Answer text (string)","Supporting facts with document references","Intermediate reasoning steps (optional, for interpretable systems)"],"categories":["planning-reasoning","search-retrieval"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hotpotqa__cap_3","uri":"capability://search.retrieval.distractor.document.filtering.and.ranking.evaluation","name":"distractor document filtering and ranking evaluation","description":"Provides a controlled evaluation setting where systems must distinguish relevant documents from distractors. The dataset includes both supporting documents (necessary for answering) and distractor documents (related to the question but not required for the answer). This tests whether retrieval systems can rank supporting documents above distractors, a critical capability for multi-hop QA where false positives in retrieval compound through reasoning stages. Evaluation measures whether systems retrieve all necessary documents while minimizing false positives.","intents":["Evaluate retrieval system precision and recall on multi-hop questions with controlled distractor sets","Test whether dense retrievers can distinguish supporting documents from topically-related distractors","Benchmark retrieval-augmented generation systems on their ability to filter noise before reasoning","Compare retrieval strategies (BM25, dense embeddings, hybrid) on multi-hop document ranking"],"best_for":["Teams optimizing retrieval components in RAG pipelines for multi-hop reasoning","Researchers evaluating dense retriever quality on compositional reasoning tasks","Developers tuning retrieval-reasoning trade-offs in multi-stage QA systems"],"limitations":["Distractor selection is heuristic-based (related Wikipedia articles); may not reflect real-world noise distributions","Assumes binary relevance (supporting vs. non-supporting); does not capture partial relevance or multi-level importance","Distractor documents are from Wikipedia; may not generalize to other document sources with different characteristics","Does not evaluate ranking quality when supporting documents are not in the candidate set"],"requires":["Document retrieval system capable of ranking candidate documents","Access to full Wikipedia corpus or pre-computed embeddings for retrieval","Evaluation harness to compute precision/recall on document-level predictions"],"input_types":["Question text (string)","Candidate document set including supporting and distractor articles (text)"],"output_types":["Document ranking scores or binary relevance predictions","Precision, recall, and MRR metrics for document retrieval","Analysis of retrieval errors (false positives, false negatives)"],"categories":["search-retrieval","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hotpotqa__cap_4","uri":"capability://data.processing.analysis.question.type.classification.and.reasoning.pattern.analysis","name":"question type classification and reasoning pattern analysis","description":"Categorizes questions into distinct reasoning types (e.g., 'bridge' questions requiring entity linking between documents, 'comparison' questions requiring fact synthesis) and provides labels enabling analysis of system performance across reasoning patterns. This allows fine-grained evaluation of which reasoning types systems handle well vs. poorly, and enables targeted training or evaluation on specific compositional reasoning challenges. The taxonomy captures the structural reasoning requirements independent of domain content.","intents":["Analyze which types of multi-hop reasoning patterns are most challenging for QA systems","Train specialized models for specific reasoning types or use question type to select appropriate reasoning strategies","Evaluate whether systems have balanced performance across reasoning types or systematic weaknesses","Debug system failures by correlating errors with question type and identifying reasoning bottlenecks"],"best_for":["Researchers analyzing reasoning capabilities and failure modes in QA systems","Teams developing adaptive QA systems that select strategies based on question type","Developers creating targeted training datasets for specific reasoning patterns"],"limitations":["Question type taxonomy is limited to a few categories; does not capture all reasoning patterns or hybrid types","Type labels are human-assigned and may contain errors or ambiguity for questions spanning multiple types","Type distribution may be imbalanced across dataset splits, affecting statistical significance of type-specific analysis","Type-based analysis assumes reasoning type is the primary factor in difficulty; other factors (entity ambiguity, document relevance) may dominate"],"requires":["Question type labels from dataset metadata","System predictions on full dataset to enable type-stratified analysis","Statistical analysis tools to compute performance metrics per type"],"input_types":["Question text with type label (string + categorical)","System predictions (answers and/or supporting facts)"],"output_types":["Per-type performance metrics (accuracy, F1, supporting fact F1)","Type-stratified error analysis and confusion matrices","Reasoning pattern difficulty rankings"],"categories":["data-processing-analysis","planning-reasoning"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hotpotqa__cap_5","uri":"capability://memory.knowledge.wikipedia.grounded.question.generation.for.domain.specific.reasoning","name":"wikipedia-grounded question generation for domain-specific reasoning","description":"Questions are generated from Wikipedia articles and require reasoning over real-world entities, relationships, and facts. This grounds reasoning in a concrete knowledge domain (Wikipedia) rather than synthetic or template-based questions, enabling evaluation of whether systems can handle real-world complexity. Questions span diverse topics (people, places, films, organizations) and reasoning patterns (attribute lookup, entity linking, relationship chaining).","intents":["Evaluate QA systems on real-world Wikipedia-based reasoning rather than synthetic templates","Test whether models can handle diverse entity types and relationship patterns from Wikipedia","Develop systems that can reason over actual knowledge bases (Wikipedia) rather than abstract examples","Benchmark generalization across different Wikipedia domains (people, films, organizations, etc.)"],"best_for":["Researchers studying reasoning over real-world knowledge bases","Teams building QA systems that must handle diverse entity types and relationships","ML engineers evaluating generalization across Wikipedia domains","Organizations implementing knowledge-base QA systems"],"limitations":["Wikipedia-specific — reasoning patterns may not transfer to other knowledge bases (scientific papers, legal documents)","Entity linking is implicit — models must learn to identify entities without explicit entity annotations","Wikipedia facts are static (2018 snapshot) — doesn't test reasoning over evolving knowledge","Reasoning patterns are limited to Wikipedia structure — may not cover all real-world reasoning types","No explicit domain labels — difficult to analyze performance across different entity types"],"requires":["Wikipedia knowledge base (2018 snapshot provided with dataset)","Entity linking capability to map question mentions to Wikipedia articles","Knowledge of Wikipedia structure and article linking patterns"],"input_types":["Natural language questions grounded in Wikipedia entities","Wikipedia article corpus with hyperlinks and structure"],"output_types":["Answers extracted from Wikipedia text","Supporting facts (sentences from Wikipedia articles)","Implicit entity and relationship chains"],"categories":["memory-knowledge","planning-reasoning"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hotpotqa__headline","uri":"capability://model.training.multi.hop.question.answering.dataset","name":"multi-hop question answering dataset","description":"HotpotQA is a multi-hop question answering dataset designed for evaluating models that require reasoning over multiple Wikipedia articles, providing 113,000 questions with supporting facts for answer extraction and explainability.","intents":["best multi-hop question answering dataset","multi-hop QA dataset for model training","datasets for reasoning over multiple documents","question answering datasets with supporting facts","datasets for explainability in AI models"],"best_for":["research in question answering","training AI models for reasoning tasks"],"limitations":[],"requires":[],"input_types":[],"output_types":[],"categories":["model-training"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":56,"verified":false,"data_access_risk":"low","permissions":["HuggingFace Datasets library (datasets>=2.0.0) for loading","Python 3.7+ for data processing","Sufficient disk space (~2GB for full dataset with Wikipedia articles)","Predicted supporting facts in same format as ground truth annotations","Python 3.7+ with standard evaluation libraries","Access to original Wikipedia articles for sentence-level matching","Full Wikipedia article corpus or access to Wikipedia API","Multi-document retrieval system capable of ranking and combining results from multiple articles","Reasoning module that can identify connections between retrieved documents","Document retrieval system capable of ranking candidate documents"],"failure_modes":["Limited to Wikipedia as source domain — may not generalize to other document types or specialized corpora","Supporting fact annotations are human-provided and subject to annotator disagreement on sentence-level boundaries","Questions are English-only; no multilingual variants for cross-lingual reasoning evaluation","Static snapshot of Wikipedia content; links and article structure may have changed since annotation","Evaluation assumes sentence-level granularity; may not capture partial relevance or nuanced supporting relationships","Human annotations may contain errors or disagreement on what constitutes sufficient supporting evidence","Metrics are reference-based; cannot evaluate supporting facts for questions with multiple valid reasoning paths","No automatic metric for evaluating whether cited facts are actually sufficient to derive the answer","All questions are answerable from Wikipedia; does not test reasoning over conflicting or uncertain information","Question types are limited to specific patterns (e.g., 'What is X's Y?'); does not cover all reasoning types","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.7,"quality":0.8500000000000001,"ecosystem":0.3,"match_graph":0.25,"freshness":0.75,"weights":{"adoption":0.3,"quality":0.25,"ecosystem":0.1,"match_graph":0.3,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-05-24T12:16:23.327Z","last_scraped_at":null,"last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=hotpotqa","compare_url":"https://unfragile.ai/compare?artifact=hotpotqa"}},"signature":"K28aZGaGdrQCSBy3LXmudQN3zfRkJ4shfTYaIqzq2+vSFQYyBYeMU1hSrL7PwsPM5Ty/O7mPEgk7nuYIK7DzBQ==","signedAt":"2026-06-23T14:42:07.092Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/hotpotqa","artifact":"https://unfragile.ai/hotpotqa","verify":"https://unfragile.ai/api/v1/verify?slug=hotpotqa","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}