Natural Questions vs cua — Comparison | Unfragile

Natural Questions vs cua

Side-by-side comparison to help you choose.

Natural Questions

Dataset

/ 100

Free

cua

Agent

/ 100

Free

Feature	Natural Questions	cua
Type	Dataset	Agent
UnfragileRank	48/100	53/100
Adoption	1	1
Quality	0	1
Ecosystem	0

Natural Questions Capabilities

open-domain question answering evaluation with retrieval + comprehension

Evaluates end-to-end QA systems by requiring models to both retrieve relevant Wikipedia passages from 5.9M articles and extract answers from those passages. Unlike single-document QA benchmarks, Natural Questions forces systems to solve the full information retrieval pipeline before reading comprehension, using real Google Search queries as ground truth for relevance. Annotators provide both paragraph-level (long answer) and entity-level (short answer) labels, enabling fine-grained performance measurement across retrieval and extraction stages.

Unique: Combines retrieval and reading comprehension in a single benchmark using real Google Search queries, forcing systems to solve the full open-domain QA pipeline rather than isolated reading comprehension on pre-selected passages. The dual-annotation scheme (long + short answers) enables separate measurement of retrieval quality and extraction accuracy.

vs alternatives: More realistic than SQuAD (which provides passage context) because it requires actual retrieval; more comprehensive than MS MARCO (which focuses on ranking) because it evaluates end-to-end answer extraction from retrieved passages

dual-level answer annotation and span extraction

Provides two complementary answer labels per question: long answers (full paragraph from Wikipedia containing the answer) and short answers (minimal entity or phrase). This dual-level annotation enables training and evaluating both passage-ranking and span-extraction components separately. Annotators mark questions as unanswerable if no Wikipedia article contains the answer, creating a realistic distribution of answerable vs. unanswerable queries matching production search logs.

Unique: Dual-level annotation (paragraph + entity) decouples retrieval evaluation from reading comprehension, allowing separate optimization of passage ranking and span extraction. The explicit unanswerable label distribution reflects real search query distributions rather than assuming all questions have answers.

vs alternatives: More granular than SQuAD's single-span annotation because it separates passage retrieval from answer extraction; more realistic than MS MARCO because it includes explicit unanswerable examples matching production query distributions

real-world query distribution from google search logs

Dataset contains 307,373 real, anonymized queries extracted from Google Search logs, ensuring the question distribution reflects actual user information needs rather than synthetic or crowdsourced questions. This ground-truth distribution includes long-tail queries, ambiguous questions, and unanswerable searches that production systems must handle. Pairing these queries with Wikipedia articles creates a realistic open-domain QA evaluation setting where systems must handle the full diversity of real user intent.

Unique: Uses real Google Search queries rather than crowdsourced or synthetic questions, capturing the true distribution of user information needs including long-tail, ambiguous, and unanswerable searches. This grounds evaluation in production-grade query patterns rather than benchmark-specific biases.

vs alternatives: More representative of real user intent than SQuAD or MS MARCO because it derives from actual search logs; captures natural query diversity and ambiguity that synthetic benchmarks cannot replicate

wikipedia corpus-based passage retrieval evaluation

Provides a fixed corpus of 5.9M Wikipedia articles as the knowledge base for retrieval evaluation. Systems must rank and retrieve relevant articles/passages from this corpus to answer questions, enabling measurement of retrieval quality (recall@k, MRR) independent of reading comprehension. The corpus is structured with article-level and paragraph-level granularity, allowing evaluation of both coarse document retrieval and fine-grained passage ranking. This setup forces realistic retrieval challenges: handling polysemy, disambiguation, and ranking relevant passages above irrelevant ones from the same article.

Unique: Provides a large, fixed Wikipedia corpus (5.9M articles) with paragraph-level granularity, enabling evaluation of both document-level and passage-level retrieval. The corpus size and diversity force systems to handle realistic retrieval challenges like disambiguation and ranking relevant passages above irrelevant ones from the same article.

vs alternatives: Larger and more diverse than MS MARCO's passage corpus because it covers all of Wikipedia; more realistic than SQuAD because it requires actual retrieval rather than providing context upfront

answerability classification and unanswerable query handling

Explicitly labels ~20% of questions as unanswerable (no Wikipedia article contains the answer), enabling evaluation of systems' ability to recognize when they cannot answer a question rather than hallucinating. This answerability classification is crucial for production systems that must gracefully handle out-of-domain or factually impossible queries. The distribution of answerable vs. unanswerable questions reflects real search query patterns, not synthetic balanced datasets.

Unique: Explicitly includes unanswerable questions (~20%) with ground-truth labels, enabling direct evaluation of systems' ability to recognize when they cannot answer. This reflects real query distributions where many searches have no valid answer in any single knowledge base.

vs alternatives: More realistic than SQuAD or MS MARCO because it includes explicit unanswerable examples; forces systems to avoid hallucination rather than assuming all questions have answers

multi-stage qa pipeline training and evaluation

Enables training and evaluating modular QA systems with separate retrieval and reading comprehension stages. The dataset structure (questions paired with Wikipedia corpus and dual-level answer annotations) supports training a dense retriever on passage relevance, a reader on span extraction, and an answerability classifier on unanswerable queries. Evaluation can measure each stage independently (retrieval recall, reader F1, answerability accuracy) or end-to-end (final answer accuracy), enabling fine-grained performance analysis and bottleneck identification.

Unique: Dataset structure explicitly supports training and evaluating modular QA pipelines with separate retrieval and reading comprehension stages. Dual-level annotations (long + short answers) and answerability labels enable independent optimization and evaluation of each component.

vs alternatives: More suitable for modular pipeline training than end-to-end QA datasets because it provides both passage-level and answer-level labels; enables separate measurement of retrieval and comprehension unlike single-stage QA benchmarks

cua Capabilities

vision-language model-driven screenshot interpretation and action reasoning

Captures desktop screenshots and feeds them to 100+ integrated vision-language models (Claude, GPT-4V, Gemini, local models via adapters) to reason about UI state and determine appropriate next actions. Uses a unified message format (Responses API) across heterogeneous model providers, enabling the agent to understand visual context and generate structured action commands without brittle selector-based logic.

Unique: Implements a unified Responses API message format abstraction layer that normalizes outputs from 100+ heterogeneous VLM providers (native computer-use models like Claude, composed models via grounding adapters, and local model adapters), eliminating provider-specific parsing logic and enabling seamless model swapping without agent code changes.

vs alternatives: Broader model coverage and provider flexibility than Anthropic's native computer-use API alone, with explicit support for local/open-source models and a standardized message format that decouples agent logic from model implementation details.

multi-os sandboxed execution environment provisioning and lifecycle management

Provisions isolated execution environments across macOS (via Lume VMs), Linux (Docker), Windows (Windows Sandbox), and host OS, with unified provider abstraction. Handles VM/container lifecycle (creation, snapshot management, cleanup), resource allocation, and OS-specific action handlers (keyboard/mouse events, clipboard, file system access) through a pluggable provider architecture that abstracts platform differences.

Unique: Implements a pluggable provider architecture with unified Computer interface that abstracts OS-specific action handlers (macOS native events via Lume, Linux X11/Wayland via Docker, Windows input simulation via Windows Sandbox API), enabling single agent code to target multiple platforms. Includes Lume VM management with snapshot/restore capabilities for deterministic testing.

vs alternatives: More comprehensive OS coverage than single-platform solutions; Lume provider offers native macOS VM support with snapshot capabilities unavailable in Docker-only alternatives, while unified provider abstraction reduces code duplication vs. platform-specific agent implementations.

Natural Questions vs cua

Natural Questions Capabilities

cua Capabilities

Verdict

Company