{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"hellaswag","slug":"hellaswag","name":"HellaSwag","type":"dataset","url":"https://huggingface.co/datasets/Rowan/hellaswag","page_url":"https://unfragile.ai/hellaswag","categories":["testing-quality"],"tags":[],"pricing":{"model":"free","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"hellaswag__cap_0","uri":"capability://data.processing.analysis.adversarial.filtered.multiple.choice.evaluation","name":"adversarial-filtered multiple-choice evaluation","description":"Evaluates language models on 70,000 multiple-choice questions where incorrect options were generated by language models and adversarially selected to fool machines while remaining obviously wrong to humans. The filtering process uses a two-stage approach: LLM-generated distractors are ranked by their ability to confuse models (measured via model accuracy on that specific question), then human annotators validate that the hard-for-models options remain easy for humans. This creates a dataset where model performance gaps vs human performance (95.6% human accuracy) directly measure commonsense reasoning gaps rather than dataset artifacts.","intents":["Benchmark a language model's commonsense reasoning ability against a human-calibrated baseline","Identify whether model failures are due to lack of world knowledge or adversarial confusion","Track progress on commonsense understanding as models improve over time","Evaluate if a model has learned genuine reasoning or is exploiting dataset biases"],"best_for":["LLM researchers evaluating frontier models on commonsense tasks","Teams building reasoning-heavy applications who need diagnostic benchmarks","Model developers tracking regression on human-aligned understanding"],"limitations":["Multiple-choice format doesn't test open-ended generation or explanation quality","Adversarial filtering is computationally expensive and may not catch all model-specific failure modes","Dataset is English-only; cross-lingual commonsense reasoning requires separate evaluation","70,000 examples may show saturation effects for frontier models approaching human performance"],"requires":["Language model with inference capability (API access or local deployment)","Ability to parse JSON-formatted multiple-choice questions","Evaluation harness to compute accuracy metrics and confidence intervals"],"input_types":["text (scenario context)","text (multiple-choice options A-D)"],"output_types":["structured data (accuracy score per model)","structured data (per-question model predictions and confidence)"],"categories":["data-processing-analysis","benchmark-evaluation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hellaswag__cap_1","uri":"capability://data.processing.analysis.physical.commonsense.continuation.prediction","name":"physical commonsense continuation prediction","description":"Tests models' ability to predict the next action or outcome in video-like scenarios involving physical activities (cooking, sports, repairs, etc.). Each question presents a sequence of events and asks which of four options most plausibly continues the sequence. The dataset uses real-world video captions and activities, grounding commonsense in concrete physical interactions rather than abstract reasoning. Models must understand object physics, tool usage, body mechanics, and temporal causality to select correct continuations.","intents":["Evaluate whether a model understands physical causality and object interactions","Test if a model can reason about multi-step procedures and their outcomes","Measure a model's ability to predict human actions in real-world contexts"],"best_for":["Robotics teams evaluating if language models can reason about physical tasks","Video understanding researchers benchmarking temporal reasoning","Embodied AI developers testing if models understand action consequences"],"limitations":["Text-only format loses visual information that humans use for physical reasoning","Scenarios are limited to common activities; rare or specialized physical tasks are underrepresented","No feedback on which physical principles the model failed to apply"],"requires":["Language model capable of multi-sentence context understanding","Access to dataset splits (train/validation/test)"],"input_types":["text (scenario description with sequential actions)"],"output_types":["categorical (selected option A/B/C/D)","structured data (accuracy on physical reasoning subset)"],"categories":["data-processing-analysis","planning-reasoning"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hellaswag__cap_2","uri":"capability://data.processing.analysis.social.and.temporal.reasoning.evaluation","name":"social and temporal reasoning evaluation","description":"Assesses models' understanding of social dynamics, conversational context, and temporal sequences in everyday scenarios. Questions test whether models can reason about social norms (what's appropriate to say/do), emotional reactions, and cause-effect relationships across time. The dataset includes scenarios involving interpersonal interactions, social etiquette, and temporal ordering of events. Adversarial distractors specifically target models that misunderstand social context or temporal logic while remaining obviously wrong to humans.","intents":["Evaluate if a model understands social norms and appropriate behavior","Test temporal reasoning: can the model order events causally and understand time progression","Measure if a model can predict human emotional or social reactions"],"best_for":["Conversational AI teams building socially-aware chatbots","Researchers studying if LLMs have learned social understanding or are pattern-matching","Teams building dialogue systems that need to reason about social context"],"limitations":["Social reasoning is culturally dependent; dataset is primarily Western/English-centric","Multiple-choice format doesn't test nuanced social judgment or ethical reasoning","No distinction between models that memorized social patterns vs. those that understand underlying principles"],"requires":["Language model with context understanding","Ability to parse social scenario descriptions"],"input_types":["text (social scenario or dialogue context)"],"output_types":["categorical (selected option A/B/C/D)","structured data (accuracy on social reasoning subset)"],"categories":["data-processing-analysis","planning-reasoning"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hellaswag__cap_3","uri":"capability://data.processing.analysis.machine.vs.human.performance.gap.analysis","name":"machine-vs-human performance gap analysis","description":"Provides a calibrated benchmark where human accuracy (95.6%) is known and adversarial filtering ensures that questions hard for machines remain easy for humans. This enables precise measurement of the performance gap between models and humans on commonsense reasoning. Researchers can use this gap to quantify progress toward human-level understanding and identify which types of commonsense reasoning (physical, social, temporal) show the largest model-human gaps.","intents":["Measure how close a model is to human-level commonsense reasoning","Identify which commonsense reasoning categories (physical, social, temporal) show the largest model-human gaps","Track whether model improvements are closing the commonsense gap or just improving on easier questions"],"best_for":["LLM researchers publishing benchmarking papers and tracking frontier model progress","Teams making go/no-go decisions on whether models are ready for production reasoning tasks","Researchers studying the nature of commonsense reasoning gaps in current models"],"limitations":["95.6% human accuracy is an aggregate; individual question difficulty varies and some may be genuinely ambiguous","Gap analysis assumes human accuracy is the correct target; some questions may have multiple valid answers","Frontier models now approach human performance, reducing the discriminative power of the benchmark for future models"],"requires":["Model evaluation harness that computes accuracy and confidence intervals","Statistical tools to compute gap analysis and per-category breakdowns"],"input_types":["model predictions (selected options and confidence scores)"],"output_types":["structured data (accuracy, human-model gap, per-category performance)","structured data (confidence intervals and statistical significance)"],"categories":["data-processing-analysis","benchmark-evaluation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hellaswag__cap_4","uri":"capability://data.processing.analysis.dataset.versioning.and.reproducibility","name":"dataset versioning and reproducibility","description":"Provides a fixed, versioned dataset of 70,000 examples with consistent train/validation/test splits, enabling reproducible evaluation across models and time. The dataset is hosted on Hugging Face with version control, allowing researchers to cite specific versions and ensuring that benchmark results are comparable across papers. The fixed nature of the dataset (no dynamic generation or augmentation) means that model improvements reflect genuine capability gains rather than dataset variance.","intents":["Ensure that benchmark results are reproducible and comparable across different papers and teams","Track model progress over time by comparing results on the same fixed dataset","Enable fair comparison between models by using identical evaluation data"],"best_for":["Researchers publishing benchmarking papers who need reproducible, citable datasets","Teams tracking model progress over time and need a stable baseline","Leaderboard maintainers who need a fixed evaluation set"],"limitations":["Fixed dataset may become saturated as frontier models approach human performance","No dynamic augmentation or adversarial examples generated at evaluation time","Dataset reflects snapshot of commonsense at time of creation; evolving social norms or new activities may not be represented"],"requires":["Hugging Face account or API access to download dataset","Ability to parse JSON-formatted dataset files"],"input_types":["none (dataset is provided)"],"output_types":["structured data (JSON with scenario, options, label, and metadata)"],"categories":["data-processing-analysis","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hellaswag__headline","uri":"capability://testing.quality.commonsense.reasoning.benchmark.dataset","name":"commonsense reasoning benchmark dataset","description":"A comprehensive dataset designed for evaluating models on commonsense reasoning through 70,000 multiple-choice questions that challenge their understanding of everyday scenarios and human-like reasoning.","intents":["best commonsense reasoning dataset","dataset for testing model understanding of everyday scenarios","commonsense reasoning benchmark for AI evaluation","top datasets for commonsense reasoning tasks","free datasets for commonsense reasoning"],"best_for":["AI researchers","developers testing LLMs"],"limitations":[],"requires":[],"input_types":[],"output_types":[],"categories":["testing-quality"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":56,"verified":false,"data_access_risk":"low","permissions":["Language model with inference capability (API access or local deployment)","Ability to parse JSON-formatted multiple-choice questions","Evaluation harness to compute accuracy metrics and confidence intervals","Language model capable of multi-sentence context understanding","Access to dataset splits (train/validation/test)","Language model with context understanding","Ability to parse social scenario descriptions","Model evaluation harness that computes accuracy and confidence intervals","Statistical tools to compute gap analysis and per-category breakdowns","Hugging Face account or API access to download dataset"],"failure_modes":["Multiple-choice format doesn't test open-ended generation or explanation quality","Adversarial filtering is computationally expensive and may not catch all model-specific failure modes","Dataset is English-only; cross-lingual commonsense reasoning requires separate evaluation","70,000 examples may show saturation effects for frontier models approaching human performance","Text-only format loses visual information that humans use for physical reasoning","Scenarios are limited to common activities; rare or specialized physical tasks are underrepresented","No feedback on which physical principles the model failed to apply","Social reasoning is culturally dependent; dataset is primarily Western/English-centric","Multiple-choice format doesn't test nuanced social judgment or ethical reasoning","No distinction between models that memorized social patterns vs. those that understand underlying principles","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.7,"quality":0.8500000000000001,"ecosystem":0.3,"match_graph":0.25,"freshness":0.75,"weights":{"adoption":0.3,"quality":0.25,"ecosystem":0.1,"match_graph":0.3,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-05-24T12:16:22.066Z","last_scraped_at":null,"last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=hellaswag","compare_url":"https://unfragile.ai/compare?artifact=hellaswag"}},"signature":"L7RlUtID4Wo21EEHBqKKxOGitrZa490kWdThNdRux5F1aoGWeDYj7nr2+EjQAiXGwB9fryS1Rze5C5v2H/HpCg==","signedAt":"2026-06-21T08:58:08.749Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/hellaswag","artifact":"https://unfragile.ai/hellaswag","verify":"https://unfragile.ai/api/v1/verify?slug=hellaswag","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}