What can SQuAD 2.0 do?

extractive question-answering benchmark with adversarial unanswerable questions, standardized evaluation metrics for extractive qa with leaderboard ranking, crowdworker-annotated question generation on wikipedia passages, multi-model training and evaluation framework for transformer architectures, adversarial question generation and answerability classification, domain-specific qa dataset construction methodology

SQuAD 2.0

DatasetFree

150K reading comprehension questions including unanswerable ones.

Open Source

/ 100

6 capabilities

Capabilities6 decomposed

extractive question-answering benchmark with adversarial unanswerable questions

Medium confidence

SQuAD 2.0 provides 150,000 questions paired with Wikipedia article passages where models must either extract the correct span from the passage or recognize when no valid answer exists. The dataset includes 50,000 adversarially-crafted unanswerable questions that are syntactically similar to answerable ones, forcing models to develop genuine reading comprehension rather than surface-level pattern matching. This is implemented as a JSON-structured dataset with passage-question-answer triplets where unanswerable questions contain plausible distractors in the passage.

Solves for

Train and evaluate extractive QA models that can distinguish answerable from unanswerable questionsBenchmark model reading comprehension against a standardized leaderboard with human performance baseline (89.5 F1)Develop robust NLP systems that know when to abstain rather than hallucinate answersCompare model performance across BERT, RoBERTa, and transformer-based architectures on a common evaluation framework

Best for

NLP researchers developing reading comprehension models

Teams building production QA systems that need to handle unanswerable queries

ML engineers benchmarking transformer model performance

Requires

Python 3.6+ with datasets library (huggingface/datasets)

Minimum 2GB disk space for full dataset download

JSON parsing capability for dataset structure

Limitations

Limited to English language only — no multilingual variants in base dataset

Extractive-only paradigm — cannot evaluate abstractive summarization or paraphrased answers

Wikipedia-domain bias — performance may not transfer to technical documentation, legal text, or domain-specific corpora

What makes it unique

First large-scale QA dataset to systematically include adversarial unanswerable questions (33% of dataset) that require models to recognize when context is insufficient, rather than forcing extraction of incorrect spans. Uses crowdworker-generated questions on real Wikipedia passages with explicit annotation of answer spans and answerability labels, creating a more realistic evaluation scenario than synthetic datasets.

vs alternatives

SQuAD 2.0 is more challenging than SQuAD 1.1 and MS MARCO because it requires models to explicitly model answerability rather than always extracting, and it uses human-written questions on real passages rather than template-based or synthetic question generation, making it a more reliable benchmark for production QA systems.

standardized evaluation metrics for extractive qa with leaderboard ranking

Medium confidence

SQuAD 2.0 provides standardized Exact Match (EM) and F1 scoring functions that measure both token-level overlap and partial credit for near-correct answers. The evaluation framework includes a public leaderboard that ranks submissions by F1 score, enabling direct comparison of model architectures. The metric computation handles edge cases like multiple valid answer spans, whitespace normalization, and article/punctuation handling through a reference implementation that all submissions must use.

Solves for

Evaluate QA model performance using standardized metrics comparable across research papers and industry implementationsRank and compare different model architectures on a public leaderboard to identify state-of-the-art approachesMeasure progress toward human-level performance (89.5 F1 baseline) with reproducible evaluationDebug model failures by analyzing per-question EM/F1 scores and identifying systematic error patterns

Best for

Researchers publishing QA model papers requiring standardized benchmarking

ML teams comparing internal model variants against published baselines

Leaderboard participants seeking transparent ranking and performance attribution

Requires

Python 3.6+ with official SQuAD evaluation script (evaluate-v2.0.py)

Predictions in JSON format matching official submission schema

Reference answers in SQuAD 2.0 format with answer spans and answerability labels

Limitations

EM metric is brittle — single character differences result in zero credit, not reflecting partial understanding

F1 metric assumes token-level granularity — does not credit semantically equivalent paraphrases or synonyms

Leaderboard does not track inference latency or model size — cannot distinguish efficient models from compute-heavy ones

What makes it unique

Implements a reference evaluation script that handles token-level F1 computation with careful normalization (article/punctuation removal, whitespace handling) and supports both answerable and unanswerable question evaluation in a single framework. The leaderboard infrastructure provides transparent ranking with submission history and model card integration, enabling reproducible comparisons across years of research.

vs alternatives

SQuAD 2.0's evaluation is more rigorous than earlier QA benchmarks because it includes answerability evaluation (not just EM/F1 for answerable questions) and the public leaderboard provides transparent ranking that has driven reproducible progress in the field, unlike proprietary benchmarks with hidden test sets.

crowdworker-annotated question generation on wikipedia passages

Medium confidence

SQuAD 2.0 uses a two-stage crowdsourcing pipeline where workers first read Wikipedia passages and generate natural language questions, then a second group of workers validates and labels whether each question is answerable from the passage. The dataset captures 150,000 human-written questions with explicit span annotations indicating where the answer appears in the passage, creating a human-quality gold standard. This approach ensures questions are naturally phrased and grounded in real text rather than template-generated or synthetic.

Solves for

Create a human-quality gold standard dataset for training and evaluating reading comprehension modelsCapture natural question phrasing and linguistic variation that humans actually use when asking about textGenerate adversarial unanswerable questions that are plausible but cannot be answered from the passageEstablish inter-annotator agreement and data quality metrics for QA dataset construction

Best for

Researchers building new QA datasets who need a reference methodology for crowdsourcing

Teams training domain-specific QA models who want to replicate SQuAD's annotation protocol

Data quality engineers designing crowdsourcing workflows with validation stages

Requires

Access to Wikipedia article text (public domain)

Crowdsourcing platform (Amazon Mechanical Turk or equivalent) with worker qualification system

Quality control infrastructure for multi-stage validation

Limitations

Crowdworker quality varies — some questions contain ambiguities or multiple valid answers not captured in single gold span

Annotation artifacts — workers may unconsciously bias questions toward certain answer types or linguistic patterns

Limited to English Wikipedia — methodology does not directly transfer to non-English or non-Wikipedia domains

What makes it unique

Implements a two-stage crowdsourcing pipeline where question generation and answerability validation are separated, reducing worker bias and enabling explicit annotation of unanswerable questions. Uses Wikipedia as the source domain because it provides diverse, well-structured passages with clear topic boundaries, and the public domain status enables open dataset release.

vs alternatives

SQuAD 2.0's annotation methodology is more rigorous than earlier QA datasets because it includes a dedicated validation stage for answerability and uses real Wikipedia passages rather than synthetic or template-generated text, resulting in higher-quality and more realistic questions.

multi-model training and evaluation framework for transformer architectures

Medium confidence

SQuAD 2.0 serves as the primary benchmark that drove development and evaluation of BERT, RoBERTa, ALBERT, ELECTRA, and subsequent transformer models. The dataset is integrated into standard NLP libraries (Hugging Face Transformers, PyTorch Lightning) with pre-built training scripts and fine-tuning examples. Models can be evaluated end-to-end by loading the dataset, fine-tuning on the training split, and submitting predictions to the leaderboard, enabling rapid iteration on architecture and hyperparameter choices.

Solves for

Fine-tune pre-trained transformer models on SQuAD 2.0 to achieve state-of-the-art QA performanceCompare different model architectures (BERT vs RoBERTa vs ALBERT) on a standardized benchmarkDevelop and test novel training techniques (data augmentation, adversarial training) with immediate feedback from leaderboard evaluationReproduce published results and validate that model implementations match reported performance

Best for

ML engineers fine-tuning pre-trained models for production QA systems

Researchers developing novel transformer architectures or training techniques

Teams benchmarking model performance across different hardware (GPU/TPU) and batch sizes

Requires

Python 3.6+ with PyTorch 1.9+ or TensorFlow 2.4+

GPU with 12GB+ VRAM (24GB+ recommended for large models)

Hugging Face Transformers library 4.0+

Limitations

Fine-tuning requires significant GPU memory (24GB+ for large models) — not accessible to resource-constrained teams

Training time is substantial (8-24 hours on single GPU) — limits rapid experimentation

Leaderboard submissions are rate-limited — cannot iterate on hyperparameters as quickly as local evaluation

What makes it unique

SQuAD 2.0 is deeply integrated into the Hugging Face Transformers ecosystem with official fine-tuning examples, pre-built training scripts, and model cards that document performance on the benchmark. This integration enables one-command fine-tuning and leaderboard submission, lowering the barrier to entry for researchers and practitioners.

vs alternatives

SQuAD 2.0 has driven more transformer model development than any other QA benchmark because it is the de facto standard for evaluating reading comprehension, has a transparent public leaderboard that incentivizes publication, and is tightly integrated into popular NLP libraries, making it easier to use than proprietary or less-integrated benchmarks.

adversarial question generation and answerability classification

Medium confidence

SQuAD 2.0 includes 50,000 unanswerable questions (33% of dataset) that are adversarially constructed to be syntactically similar to answerable questions but lack a valid answer in the passage. These questions are generated by crowdworkers who read answerable questions and passages, then write new questions that look like they should be answerable but are not. Models must learn to classify whether a question is answerable (binary classification) in addition to extracting the answer span, requiring genuine reading comprehension rather than surface-level matching.

Solves for

Train models to recognize when they lack sufficient information to answer a question, reducing hallucination in production systemsEvaluate whether models perform genuine reading comprehension or rely on spurious correlations and pattern matchingDevelop robust QA systems that can confidently abstain on out-of-distribution or unanswerable queriesMeasure model calibration — whether confidence scores correlate with answerability and correctness

Best for

Teams building production QA systems that need to handle unanswerable queries gracefully

Researchers studying model robustness and hallucination mitigation

ML engineers evaluating whether models have learned genuine comprehension vs pattern matching

Requires

SQuAD 2.0 dataset with answerability labels

Model architecture that supports both span extraction and answerability classification (e.g., BERT with two output heads)

Training procedure that balances answerable and unanswerable examples

Limitations

Adversarial questions are still generated by humans — may not capture all types of unanswerable queries in production

Binary answerability classification is coarse-grained — does not distinguish between 'no answer in passage' and 'answer is ambiguous'

No explicit modeling of answer confidence — models must infer answerability from span extraction confidence

What makes it unique

SQuAD 2.0's adversarial unanswerable questions are human-generated rather than rule-based or synthetic, making them more realistic and harder to game. The annotation process explicitly separates question generation from answerability validation, ensuring that unanswerable questions are plausible and not obviously wrong, forcing models to perform genuine reading comprehension.

vs alternatives

SQuAD 2.0's adversarial evaluation is more challenging than SQuAD 1.1 or other extractive QA benchmarks because it requires models to both extract answers and recognize when no answer exists, preventing models from achieving high performance through simple pattern matching or always-extract strategies.

domain-specific qa dataset construction methodology

Medium confidence

SQuAD 2.0 establishes a replicable methodology for constructing large-scale QA datasets: (1) select source domain (Wikipedia), (2) crowdsource question generation on passages, (3) validate answerability with second-stage annotation, (4) compute inter-annotator agreement, (5) release with standardized evaluation metrics. This methodology has been adapted to create SQuAD-style datasets in other domains (NewsQA, TriviaQA, HotpotQA) and languages (Chinese, German, French). Teams can follow this blueprint to build domain-specific QA datasets with similar quality and scale.

Solves for

Build domain-specific QA datasets (legal, medical, technical) by replicating SQuAD's annotation methodologyEstablish quality standards and inter-annotator agreement metrics for QA dataset constructionCreate multilingual QA datasets by translating SQuAD methodology to new languages and domainsDesign crowdsourcing workflows that balance quality, cost, and scale for QA annotation

Best for

Teams building QA systems for specialized domains (legal, medical, technical documentation)

Researchers creating multilingual or cross-lingual QA benchmarks

Data engineers designing crowdsourcing workflows for NLP annotation

Requires

Source text corpus (Wikipedia, domain-specific documents, or web crawl)

Crowdsourcing platform with worker qualification and quality control systems

Annotation guidelines and training materials

Limitations

Methodology is labor-intensive — requires significant crowdsourcing budget and quality control overhead

Wikipedia-specific design choices may not transfer to other domains (e.g., technical documentation has different passage structure)

No guidance on handling domain-specific challenges (e.g., multi-hop reasoning in legal documents, technical jargon)

What makes it unique

SQuAD 2.0 establishes a two-stage crowdsourcing methodology with explicit validation of answerability, which has become the de facto standard for QA dataset construction. The published methodology includes detailed annotation guidelines, quality control procedures, and inter-annotator agreement metrics, enabling reproducible dataset construction in new domains and languages.

vs alternatives

SQuAD 2.0's methodology is more rigorous than earlier QA dataset construction approaches because it includes a dedicated validation stage for answerability, publishes detailed annotation guidelines and quality metrics, and has been successfully replicated in multiple domains and languages, demonstrating its generalizability.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with SQuAD 2.0, ranked by overlap. Discovered automatically through the match graph.

Dataset48

Natural Questions

307K real Google Search queries answered from Wikipedia.

open-domain question answering evaluation with retrieval + comprehensiondual-level answer annotation and span extractionmulti-stage qa pipeline training and evaluation

3 shared capabilities

Model40

tinyroberta-squad2

question-answering model by undefined. 1,44,130 downloads.

extractive question-answering with span selectionunanswerable question detection

2 shared capabilities

Model34

minilm-uncased-squad2

question-answering model by undefined. 33,041 downloads.

extractive question-answering on document passages

1 shared capability

Model35

bert-base-cased-squad2

question-answering model by undefined. 54,241 downloads.

extractive question-answering on document passages

1 shared capability

Dataset26

ai2_arc

Dataset by allenai. 4,06,798 downloads.

open-domain question-answering evaluation framework

1 shared capability

Model39

roberta-large-squad2

question-answering model by undefined. 2,40,125 downloads.

extractive question-answering with span prediction

1 shared capability

Best For

✓NLP researchers developing reading comprehension models
✓Teams building production QA systems that need to handle unanswerable queries
✓ML engineers benchmarking transformer model performance
✓Academic institutions teaching information extraction and NLP fundamentals
✓Researchers publishing QA model papers requiring standardized benchmarking
✓ML teams comparing internal model variants against published baselines
✓Leaderboard participants seeking transparent ranking and performance attribution
✓Practitioners validating that their QA systems meet minimum performance thresholds

Known Limitations

⚠Limited to English language only — no multilingual variants in base dataset
⚠Extractive-only paradigm — cannot evaluate abstractive summarization or paraphrased answers
⚠Wikipedia-domain bias — performance may not transfer to technical documentation, legal text, or domain-specific corpora
⚠Static benchmark — no temporal evaluation of how model performance degrades on out-of-distribution questions
⚠Crowdworker annotation artifacts — some questions may contain ambiguities or multiple valid answers not captured in single gold span
⚠EM metric is brittle — single character differences result in zero credit, not reflecting partial understanding

Requirements

Python 3.6+ with datasets library (huggingface/datasets)Minimum 2GB disk space for full dataset downloadJSON parsing capability for dataset structurePyTorch or TensorFlow for model training integrationPython 3.6+ with official SQuAD evaluation script (evaluate-v2.0.py)Predictions in JSON format matching official submission schemaReference answers in SQuAD 2.0 format with answer spans and answerability labelsAccess to Wikipedia article text (public domain)

Input / Output

Accepts: Wikipedia article passages (plain text, 100-500 tokens typical), Natural language questions (5-25 tokens typical), Answer spans (character offsets or text snippets), Model predictions (JSON with question_id → answer_text or is_impossible flag), Reference dataset (JSON with gold answer spans and answerability annotations), Wikipedia article passages (plain text), Crowdworker-generated questions (natural language), Validation annotations (answerability labels, span selections), Pre-trained transformer model checkpoints, Training hyperparameters (learning rate, batch size, epochs), SQuAD 2.0 training and validation splits, Passage text (Wikipedia articles), Question text (natural language), Answerability labels (binary: answerable or unanswerable), Answer spans (for answerable questions only), Source passages (plain text, 100-500 tokens typical), Crowdworker annotations (questions, answer spans, answerability labels), Quality control feedback (agreement scores, flagged examples)

Produces: Predicted answer spans (character start/end indices), Confidence scores for answerability, F1 and Exact Match (EM) metrics for evaluation, Per-question and aggregate performance statistics, Exact Match (EM) percentage (0-100), F1 score (0-100), Per-question EM and F1 breakdowns, Leaderboard ranking and submission metadata, Annotated dataset (JSON with passage, question, answer span, answerability label), Inter-annotator agreement statistics, Quality metrics (agreement rate, span precision), Fine-tuned model checkpoint, Validation metrics (EM, F1) per epoch, Test set predictions (JSON format for leaderboard submission), Training curves and loss trajectories, Answerability prediction (binary classification), Answer span prediction (for answerable questions), Confidence scores for both predictions, F1 and EM metrics for answerable questions, accuracy for answerability classification, Annotated QA dataset (JSON format), Quality control report (annotation errors, edge cases), Evaluation metrics and leaderboard (optional)

UnfragileRank

Adoption70%(35% weight)

Quality28%(25% weight)

Ecosystem50%(20% weight)

Match Graph10%(15% weight)

Freshness100%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Dataset

6 capabilities

Visit SQuAD 2.0→

About

Stanford's reading comprehension benchmark containing 150,000 questions posed by crowdworkers on Wikipedia articles. SQuAD 2.0 adds 50,000 unanswerable questions that look similar to answerable ones, requiring models to know when they cannot answer from the given context. The foundational benchmark for extractive question answering that drove the development of BERT, RoBERTa, and subsequent pre-trained models. Human F1 score is 89.5; models now exceed this on the leaderboard.

Alternatives to SQuAD 2.0

cua53Agent

Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).

Compare →

Hugging Face43Platform

The GitHub for AI — 500K+ models, datasets, Spaces, Inference API, hub for open-source AI.

Compare →

Stable-Diffusion55Repository

FLUX, Stable Diffusion, SDXL, SD3, LoRA, Fine Tuning, DreamBooth, Training, Automatic1111, Forge WebUI, SwarmUI, DeepFake, TTS, Animation, Text To Video, Tutorials, Guides, Lectures, Courses, ComfyUI, Google Colab, RunPod, Kaggle, NoteBooks, ControlNet, TTS, Voice Cloning, AI, AI News, ML, ML News,

Compare →

YOLOv846Model

Real-time object detection, segmentation, and pose.

Compare →

Are you the builder of SQuAD 2.0?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

seed developer essentials

Looking for something else?

Search →

Capabilities6 decomposed

extractive question-answering benchmark with adversarial unanswerable questions

Medium confidence

Solves for

Best for

NLP researchers developing reading comprehension models

Teams building production QA systems that need to handle unanswerable queries

ML engineers benchmarking transformer model performance

Requires

Python 3.6+ with datasets library (huggingface/datasets)

Minimum 2GB disk space for full dataset download

JSON parsing capability for dataset structure

Limitations

Limited to English language only — no multilingual variants in base dataset

Extractive-only paradigm — cannot evaluate abstractive summarization or paraphrased answers

Wikipedia-domain bias — performance may not transfer to technical documentation, legal text, or domain-specific corpora

What makes it unique

vs alternatives

standardized evaluation metrics for extractive qa with leaderboard ranking

Medium confidence

Solves for

Best for

Researchers publishing QA model papers requiring standardized benchmarking

ML teams comparing internal model variants against published baselines

Leaderboard participants seeking transparent ranking and performance attribution

Requires

Python 3.6+ with official SQuAD evaluation script (evaluate-v2.0.py)

Predictions in JSON format matching official submission schema

Reference answers in SQuAD 2.0 format with answer spans and answerability labels

Limitations

EM metric is brittle — single character differences result in zero credit, not reflecting partial understanding

F1 metric assumes token-level granularity — does not credit semantically equivalent paraphrases or synonyms

Leaderboard does not track inference latency or model size — cannot distinguish efficient models from compute-heavy ones

What makes it unique

vs alternatives

crowdworker-annotated question generation on wikipedia passages

Medium confidence

Solves for

Best for

Researchers building new QA datasets who need a reference methodology for crowdsourcing

Teams training domain-specific QA models who want to replicate SQuAD's annotation protocol

Data quality engineers designing crowdsourcing workflows with validation stages

Requires

Access to Wikipedia article text (public domain)

Crowdsourcing platform (Amazon Mechanical Turk or equivalent) with worker qualification system

Quality control infrastructure for multi-stage validation

Limitations

Crowdworker quality varies — some questions contain ambiguities or multiple valid answers not captured in single gold span

Annotation artifacts — workers may unconsciously bias questions toward certain answer types or linguistic patterns

Limited to English Wikipedia — methodology does not directly transfer to non-English or non-Wikipedia domains

What makes it unique

vs alternatives

multi-model training and evaluation framework for transformer architectures

Medium confidence

Solves for

Best for

ML engineers fine-tuning pre-trained models for production QA systems

Researchers developing novel transformer architectures or training techniques

Teams benchmarking model performance across different hardware (GPU/TPU) and batch sizes

Requires

Python 3.6+ with PyTorch 1.9+ or TensorFlow 2.4+

GPU with 12GB+ VRAM (24GB+ recommended for large models)

Hugging Face Transformers library 4.0+

Limitations

Fine-tuning requires significant GPU memory (24GB+ for large models) — not accessible to resource-constrained teams

Training time is substantial (8-24 hours on single GPU) — limits rapid experimentation

Leaderboard submissions are rate-limited — cannot iterate on hyperparameters as quickly as local evaluation

What makes it unique

vs alternatives

adversarial question generation and answerability classification

Medium confidence

Solves for

Best for

Teams building production QA systems that need to handle unanswerable queries gracefully

Researchers studying model robustness and hallucination mitigation

ML engineers evaluating whether models have learned genuine comprehension vs pattern matching

Requires

SQuAD 2.0 dataset with answerability labels

Model architecture that supports both span extraction and answerability classification (e.g., BERT with two output heads)

Training procedure that balances answerable and unanswerable examples

Limitations

Adversarial questions are still generated by humans — may not capture all types of unanswerable queries in production

Binary answerability classification is coarse-grained — does not distinguish between 'no answer in passage' and 'answer is ambiguous'

No explicit modeling of answer confidence — models must infer answerability from span extraction confidence

What makes it unique

vs alternatives

domain-specific qa dataset construction methodology

Medium confidence

Solves for

Best for

Teams building QA systems for specialized domains (legal, medical, technical documentation)

Researchers creating multilingual or cross-lingual QA benchmarks

Data engineers designing crowdsourcing workflows for NLP annotation

Requires

Source text corpus (Wikipedia, domain-specific documents, or web crawl)

Crowdsourcing platform with worker qualification and quality control systems

Annotation guidelines and training materials

Limitations

Methodology is labor-intensive — requires significant crowdsourcing budget and quality control overhead

Wikipedia-specific design choices may not transfer to other domains (e.g., technical documentation has different passage structure)

No guidance on handling domain-specific challenges (e.g., multi-hop reasoning in legal documents, technical jargon)

What makes it unique

vs alternatives

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

About

Alternatives to SQuAD 2.0

cua53Agent

Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).

Compare →

Hugging Face43Platform

The GitHub for AI — 500K+ models, datasets, Spaces, Inference API, hub for open-source AI.

Compare →

Stable-Diffusion55Repository

Compare →

YOLOv846Model

Real-time object detection, segmentation, and pose.

Compare →

SQuAD 2.0

Capabilities6 decomposed

extractive question-answering benchmark with adversarial unanswerable questions

standardized evaluation metrics for extractive qa with leaderboard ranking

crowdworker-annotated question generation on wikipedia passages

multi-model training and evaluation framework for transformer architectures

adversarial question generation and answerability classification

domain-specific qa dataset construction methodology

Related Artifactssharing capabilities

Natural Questions

tinyroberta-squad2

minilm-uncased-squad2

bert-base-cased-squad2

ai2_arc

roberta-large-squad2

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to SQuAD 2.0

Are you the builder of SQuAD 2.0?

Get the weekly brief

Data Sources

SQuAD 2.0

Capabilities6 decomposed

extractive question-answering benchmark with adversarial unanswerable questions

standardized evaluation metrics for extractive qa with leaderboard ranking

crowdworker-annotated question generation on wikipedia passages

multi-model training and evaluation framework for transformer architectures

adversarial question generation and answerability classification

domain-specific qa dataset construction methodology

Related Artifactssharing capabilities

Natural Questions

tinyroberta-squad2

minilm-uncased-squad2

bert-base-cased-squad2

ai2_arc

roberta-large-squad2

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to SQuAD 2.0

Are you the builder of SQuAD 2.0?

Get the weekly brief

Data Sources