ai2_arc
DatasetFreeDataset by allenai. 4,06,798 downloads.
Capabilities6 decomposed
multiple-choice question-answering dataset curation
Medium confidenceProvides a curated collection of 7,787 multiple-choice science questions (Challenge set) and 99,911 additional questions (full corpus) sourced from real educational assessments and standardized tests. The dataset is structured with question text, four answer options, and ground-truth labels, enabling direct training and evaluation of QA models on grade-school science reasoning tasks without requiring annotation from scratch.
Combines two distinct question sources (Challenge set from ARC competition + Easy/Medium/Hard tiers from broader corpus) with explicit difficulty stratification and sourcing from real standardized tests rather than synthetic generation, enabling controlled evaluation across reasoning difficulty levels
Larger and more diverse than SQuAD (extractive QA only) and more grounded in real educational assessments than RACE, making it better suited for evaluating reasoning-heavy multiple-choice understanding
parquet-based dataset streaming and lazy loading
Medium confidenceImplements efficient columnar storage via Apache Parquet format with HuggingFace Datasets library integration, enabling lazy row-level access without loading the entire 406K+ question corpus into memory. The streaming architecture supports batch iteration, random sampling, and train/test split management through the datasets library's memory-mapped file handling and automatic caching mechanisms.
Leverages HuggingFace Datasets' memory-mapped Parquet backend with automatic split management (train/test/validation) and built-in caching, avoiding manual file I/O and enabling seamless integration with PyTorch DataLoader and TensorFlow tf.data pipelines
More memory-efficient than CSV-based datasets (columnar compression) and simpler than custom HDF5 implementations while maintaining compatibility with standard ML training frameworks
train-test split stratification and benchmark reproducibility
Medium confidenceProvides pre-defined train/test splits (Challenge set: 1,119 test questions; Easy/Medium/Hard tiers: stratified by difficulty) with fixed random seeds and deterministic sampling, ensuring reproducible model evaluation across research teams. The split structure enables fair comparison of model architectures by controlling for data leakage and maintaining consistent evaluation protocols across published benchmarks.
Combines difficulty-stratified splits (Easy/Medium/Hard tiers) with a separate Challenge set from the ARC competition, enabling both broad evaluation and targeted assessment of model reasoning on harder questions, while maintaining fixed seeds for deterministic reproducibility
More rigorous than ad-hoc 80/20 splits by explicitly controlling for difficulty distribution and providing a separate challenge benchmark, similar to GLUE but with science-domain specificity
cross-framework dataset compatibility and format export
Medium confidenceSupports seamless integration with multiple data processing ecosystems (pandas DataFrames, polars, MLCroissant metadata format) and export to standard formats (CSV, JSON, parquet), enabling interoperability across PyTorch, TensorFlow, scikit-learn, and custom training pipelines. The HuggingFace Datasets library abstraction handles format conversion automatically, removing friction from data pipeline construction.
Provides native integration with HuggingFace Datasets library's format abstraction layer, enabling single-line conversions to pandas/polars/CSV/JSON while maintaining metadata through MLCroissant standard, rather than requiring manual serialization code
More flexible than raw parquet files (which require custom deserialization) and simpler than building custom ETL pipelines, with automatic handling of schema preservation across format conversions
open-domain question-answering evaluation framework
Medium confidenceEnables evaluation of open-domain QA systems (not just multiple-choice) by providing ground-truth answer labels that can be compared against model predictions using standard metrics (exact match, F1 score, BLEU). The dataset structure supports both extractive QA evaluation (matching answer spans) and generative QA evaluation (comparing predicted text to reference answers), making it suitable for benchmarking diverse QA architectures.
Provides ground-truth labels for both multiple-choice classification and open-domain QA evaluation, enabling researchers to benchmark models that generate free-form answers by comparing predictions to the correct option text, rather than limiting evaluation to multiple-choice accuracy
More versatile than SQuAD (extractive-only) for evaluating generative QA, and more rigorous than RACE by including explicit difficulty stratification and sourcing from real standardized assessments
science-domain reasoning benchmark with difficulty tiers
Medium confidenceOrganizes 99,911 science questions into explicit Easy, Medium, and Hard difficulty tiers (plus a separate 1,119-question Challenge set from the ARC competition), enabling targeted evaluation of model reasoning capabilities across complexity levels. The tiered structure allows researchers to diagnose where models fail (e.g., struggling with Hard questions but succeeding on Easy) and to measure progress on increasingly difficult reasoning tasks without requiring manual difficulty annotation.
Combines pre-stratified difficulty tiers (Easy/Medium/Hard) with a separate Challenge set from the ARC competition, providing both broad coverage of science questions and a curated set of particularly difficult questions for targeted reasoning evaluation
More granular than single-difficulty benchmarks like SQuAD, and more grounded in real educational assessments than synthetically-generated difficulty tiers, enabling precise diagnosis of model reasoning limitations
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with ai2_arc, ranked by overlap. Discovered automatically through the match graph.
medical-qa-shared-task-v1-toy
Dataset by lavita. 5,25,534 downloads.
OpenThoughts-1k-sample
Dataset by ryanmarten. 5,33,474 downloads.
SWE-bench_Verified
Dataset by princeton-nlp. 6,78,148 downloads.
mmlu
Dataset by cais. 4,39,045 downloads.
hellaswag
Dataset by Rowan. 3,02,975 downloads.
TrustLLM
8-dimension trustworthiness benchmark for LLMs.
Best For
- ✓ML researchers evaluating QA model architectures
- ✓Teams building educational AI tutoring systems
- ✓Developers benchmarking LLM reasoning capabilities on science tasks
- ✓ML engineers training models on resource-constrained hardware
- ✓Researchers iterating rapidly on model architectures with large datasets
- ✓Teams deploying models in production with strict memory budgets
- ✓Researchers publishing QA model benchmarks requiring reproducibility
- ✓Teams comparing multiple model architectures on identical test sets
Known Limitations
- ⚠Limited to English-language science questions only — no multilingual coverage
- ⚠Grade-school science focus may not generalize to advanced domain-specific QA
- ⚠Fixed question set limits continuous evaluation — no dynamic question generation
- ⚠No temporal metadata or difficulty stratification beyond train/test splits
- ⚠Parquet format requires datasets library — no native SQL query support
- ⚠Lazy loading adds ~50-100ms per batch fetch due to deserialization overhead
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
ai2_arc — a dataset on HuggingFace with 4,06,798 downloads
Categories
Alternatives to ai2_arc
Are you the builder of ai2_arc?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →