{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"math","slug":"math","name":"MATH","type":"dataset","url":"https://huggingface.co/datasets/lighteval/MATH","page_url":"https://unfragile.ai/math","categories":["testing-quality"],"tags":[],"pricing":{"model":"free","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"math__cap_0","uri":"capability://data.processing.analysis.competition.mathematics.problem.corpus.construction.and.curation","name":"competition-mathematics problem corpus construction and curation","description":"Aggregates 12,500 hand-curated competition mathematics problems sourced from AMC (American Mathematics Competitions), AIME (American Invitational Mathematics Examination), and other prestigious math olympiads. Problems are structured with metadata including difficulty ratings (1-5 scale), subject classification across 7 domains, and complete step-by-step solutions. The curation process filters for problems that require genuine mathematical reasoning rather than pattern matching, enabling reliable evaluation of model reasoning depth.","intents":["Evaluate whether an LLM can solve authentic competition-level math problems requiring multi-step reasoning","Benchmark model performance across specific mathematical domains (algebra, geometry, number theory) to identify capability gaps","Train reasoning models on problems with verified solutions to improve chain-of-thought and step-by-step problem solving","Compare model performance trajectories over time using a stable, difficulty-stratified benchmark"],"best_for":["AI researchers evaluating reasoning capabilities of large language models","Teams training specialized math-solving agents or tutoring systems","Organizations benchmarking model improvements across reasoning-heavy tasks","Researchers studying scaling laws and emergence of mathematical reasoning"],"limitations":["Dataset is static and finite (12,500 problems) — does not grow with new competition years after curation cutoff","Problems require symbolic/algebraic reasoning; limited coverage of applied mathematics or real-world problem contexts","Difficulty ratings are subjective and may not correlate uniformly with model performance across different architectures","Solutions are provided in natural language format, not machine-parseable structured representations, requiring custom parsing for automated evaluation","No built-in support for partial credit or intermediate step validation — evaluation is typically binary (correct final answer or not)"],"requires":["Hugging Face Datasets library (datasets>=2.0.0) or direct download access","Python 3.7+ for dataset loading and processing","Sufficient disk space (~500MB-1GB for full dataset with solutions)","LLM or reasoning model capable of generating multi-token mathematical expressions and symbolic reasoning"],"input_types":["problem statement (natural language text with mathematical notation)","optional: difficulty level filter (integer 1-5)","optional: subject category filter (string: Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, Precalculus)"],"output_types":["structured dataset records (problem text, solution steps, difficulty, subject, answer)","evaluation metrics (accuracy percentage, per-subject performance, difficulty-stratified scores)","model predictions (generated solution text, final numerical answer)"],"categories":["data-processing-analysis","testing-quality"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"math__cap_1","uri":"capability://data.processing.analysis.difficulty.stratified.problem.sampling.and.filtering","name":"difficulty-stratified problem sampling and filtering","description":"Enables selective sampling of problems across a 5-level difficulty scale, allowing researchers to construct evaluation sets tailored to specific model capability ranges. The difficulty metadata is pre-assigned during curation, enabling efficient filtering without re-evaluation. This supports progressive evaluation strategies where models are first tested on easier problems (difficulty 1-2) before advancing to harder ones (difficulty 4-5), reducing computational waste on problems beyond a model's current capability.","intents":["Create evaluation subsets that match a model's expected capability level to avoid ceiling or floor effects","Perform difficulty-aware benchmarking to identify at what problem complexity a model's performance degrades","Construct progressive evaluation pipelines that stop testing when a model reaches a performance threshold","Analyze scaling laws by comparing model performance across difficulty levels as model size increases"],"best_for":["Researchers studying model scaling and emergence of reasoning capabilities","Teams iteratively improving math-solving models and needing targeted evaluation","Organizations with limited compute budgets wanting to prioritize evaluation on relevant difficulty ranges"],"limitations":["Difficulty ratings are subjective and assigned during curation — may not align with actual model-specific difficulty (e.g., a model trained on geometry may find geometry problems easier than assigned)","No dynamic difficulty adjustment based on model performance — filtering is static based on pre-assigned labels","Difficulty distribution across subjects may be uneven (e.g., more hard geometry problems than hard prealgebra problems)"],"requires":["Hugging Face Datasets library with filter/select functionality","Python 3.7+ for dataset manipulation","Knowledge of difficulty scale (1-5) and subject categories to construct meaningful subsets"],"input_types":["difficulty level range (integer 1-5 or subset)","optional: subject filter (string)","optional: sample size (integer)"],"output_types":["filtered dataset subset (problems matching difficulty/subject criteria)","difficulty distribution statistics (count of problems per difficulty level)"],"categories":["data-processing-analysis","testing-quality"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"math__cap_2","uri":"capability://data.processing.analysis.subject.domain.problem.categorization.and.retrieval","name":"subject-domain problem categorization and retrieval","description":"Organizes 12,500 problems into 7 distinct mathematical subject categories (Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, Precalculus), enabling domain-specific evaluation and analysis. Each problem is tagged with its primary subject during curation, allowing researchers to isolate performance on specific mathematical domains and identify capability gaps (e.g., a model may excel at algebra but struggle with geometry). Supports both filtering and aggregation queries across subject boundaries.","intents":["Evaluate model performance on specific mathematical domains to identify domain-specific weaknesses","Train specialized models on particular subjects by filtering the dataset to domain-specific problems","Analyze whether models have balanced mathematical knowledge or show domain-specific biases","Construct balanced evaluation sets with equal representation from each mathematical subject"],"best_for":["Researchers analyzing domain-specific reasoning capabilities and identifying capability gaps","Teams training subject-specific tutoring or problem-solving systems","Organizations wanting to understand model performance across mathematical disciplines"],"limitations":["Problems are assigned to a single primary subject — does not capture multi-domain problems that require knowledge from multiple subjects","Subject taxonomy is fixed at 7 categories — no hierarchical organization (e.g., Algebra is not subdivided into Linear Algebra, Polynomial Algebra)","Subject distribution may be uneven (e.g., more Algebra problems than Geometry problems in the dataset)"],"requires":["Hugging Face Datasets library with filter/group_by functionality","Python 3.7+ for dataset manipulation","Familiarity with the 7 subject categories to construct meaningful queries"],"input_types":["subject category name (string: Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, Precalculus)","optional: multiple subjects (list of strings for multi-subject filtering)"],"output_types":["filtered dataset subset (problems in specified subject(s))","subject distribution statistics (count of problems per subject, percentage breakdown)","per-subject performance metrics (accuracy, average steps to solution)"],"categories":["data-processing-analysis","testing-quality"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"math__cap_3","uri":"capability://data.processing.analysis.step.by.step.solution.annotation.and.verification","name":"step-by-step solution annotation and verification","description":"Each of the 12,500 problems includes detailed step-by-step solutions that decompose the problem-solving process into intermediate reasoning steps. Solutions are provided in natural language format with mathematical notation, enabling evaluation of not just final answers but also intermediate reasoning quality. This supports training and evaluation of chain-of-thought reasoning models, where the ability to generate correct intermediate steps is as important as reaching the correct final answer. Solutions are verified by domain experts during curation, ensuring correctness.","intents":["Train chain-of-thought reasoning models by providing ground-truth step-by-step solutions as training targets","Evaluate whether a model can generate correct intermediate reasoning steps, not just final answers","Analyze model reasoning quality by comparing generated steps against reference solutions","Support few-shot prompting strategies that use reference solutions as examples of correct reasoning"],"best_for":["Researchers training and evaluating chain-of-thought and step-by-step reasoning models","Teams building math tutoring systems that need to explain solution steps to students","Organizations studying intermediate reasoning quality and error propagation in multi-step problems"],"limitations":["Solutions are provided in natural language format, not machine-parseable structured representations — requires custom parsing to extract individual steps","Solution granularity is variable (some problems have 3 steps, others have 10+) — no standardized step format for comparison","Solutions are written by humans and may not match the exact reasoning path a model generates, even if both are correct","No automated step-level correctness evaluation — requires custom metrics or manual review to assess intermediate step quality"],"requires":["Hugging Face Datasets library to access solution text","Python 3.7+ for solution parsing and processing","Optional: custom parsing logic to extract individual steps from natural language solutions","Optional: LLM or similarity metric to compare generated steps against reference solutions"],"input_types":["problem statement (natural language text)","optional: model-generated solution steps (natural language text)"],"output_types":["reference solution text (natural language with mathematical notation)","step-level evaluation metrics (step correctness, step similarity to reference)","solution quality assessment (complete vs partial solutions, reasoning validity)"],"categories":["data-processing-analysis","testing-quality"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"math__cap_4","uri":"capability://data.processing.analysis.benchmark.performance.tracking.and.historical.comparison","name":"benchmark performance tracking and historical comparison","description":"Provides a stable, unchanging evaluation set that enables longitudinal tracking of model performance improvements over time. The dataset's fixed composition (12,500 problems) and expert-curated solutions allow researchers to compare results across different model versions, architectures, and training approaches using identical evaluation conditions. Historical performance data (e.g., GPT-3 at 6.9%, o3 and DeepSeek R1 at 90%+) is tracked and published, enabling researchers to contextualize new model performance against established baselines.","intents":["Track performance improvements of a model across training iterations or versions using a fixed benchmark","Compare performance of different model architectures on identical problems to isolate architectural differences","Establish baseline performance for new models and contextualize results against historical data","Measure progress toward human-level or superhuman performance on competition mathematics"],"best_for":["AI researchers and organizations tracking model improvement over time","Teams comparing their models against published baselines and historical performance","Organizations publishing model results and wanting standardized comparison points"],"limitations":["Dataset is static — cannot capture performance on new competition problems released after curation cutoff","Performance saturation risk — as models improve, the dataset may become too easy to differentiate between top models (ceiling effect)","Historical performance data is sparse — only a few major model checkpoints are published, making trend analysis difficult","Benchmark may become outdated as models improve — current top models (o3, DeepSeek R1) achieve 90%+ accuracy, limiting future differentiation"],"requires":["Hugging Face Datasets library to load the dataset","Python 3.7+ for evaluation and metric computation","LLM or reasoning model to generate predictions on problems","Optional: published historical performance data for comparison"],"input_types":["model predictions on all 12,500 problems (final answers or full solutions)","optional: historical performance data for comparison (JSON or CSV with model names and accuracy scores)"],"output_types":["accuracy metrics (overall accuracy, per-subject accuracy, per-difficulty accuracy)","performance comparison tables (model vs model, version vs version)","performance trend analysis (accuracy improvement over time)"],"categories":["data-processing-analysis","testing-quality"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"math__cap_5","uri":"capability://data.processing.analysis.multi.subject.balanced.evaluation.set.construction","name":"multi-subject balanced evaluation set construction","description":"Enables construction of evaluation sets with balanced representation across the 7 mathematical subjects, ensuring that benchmark results are not skewed by subject-specific performance variations. Researchers can programmatically sample equal numbers of problems from each subject (e.g., 100 problems per subject for a 700-problem evaluation set) or weight sampling by subject difficulty distribution. This supports fair, representative evaluation that reflects overall mathematical reasoning capability rather than performance on a single domain.","intents":["Create balanced evaluation sets that fairly represent all mathematical domains without subject-specific bias","Ensure benchmark results reflect overall mathematical reasoning capability, not just performance on overrepresented subjects","Compare model performance across subjects using equal-sized subsets to isolate domain-specific strengths and weaknesses","Design evaluation protocols that weight subjects by difficulty or importance for specific applications"],"best_for":["Researchers wanting fair, representative benchmarking across mathematical domains","Teams designing evaluation protocols for math-solving models","Organizations publishing benchmark results and wanting to avoid subject-specific bias"],"limitations":["Balanced sampling may not reflect real-world problem distributions (e.g., in practice, algebra problems may be more common than number theory)","Subject distribution in the original dataset may be uneven, limiting the size of balanced subsets (smallest subject determines maximum subset size)","Balancing by difficulty adds complexity — requires careful weighting to avoid over-representing easy or hard problems"],"requires":["Hugging Face Datasets library with sampling and filtering functionality","Python 3.7+ for dataset manipulation","Knowledge of desired subset size and subject weighting strategy"],"input_types":["desired subset size (integer, total problems to sample)","optional: subject weights (dictionary mapping subject names to weights)","optional: difficulty constraints (range 1-5 to limit sampling to specific difficulty levels)"],"output_types":["balanced evaluation dataset (problems sampled equally or weighted across subjects)","subject distribution statistics (count and percentage of problems per subject in final set)","sampling metadata (random seed for reproducibility)"],"categories":["data-processing-analysis","testing-quality"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"math__headline","uri":"capability://testing.quality.benchmark.dataset.for.mathematical.reasoning","name":"benchmark dataset for mathematical reasoning","description":"A comprehensive benchmark dataset containing 12,500 competition-level mathematics problems designed to test and evaluate genuine mathematical reasoning across various subjects and difficulty levels.","intents":["best math reasoning benchmark","math dataset for AI training","competition math problems dataset","benchmark for mathematical problem-solving","dataset for evaluating math capabilities"],"best_for":["AI model training","educational assessments"],"limitations":["focused on competition-level problems"],"requires":[],"input_types":[],"output_types":[],"categories":["testing-quality"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":56,"verified":false,"data_access_risk":"low","permissions":["Hugging Face Datasets library (datasets>=2.0.0) or direct download access","Python 3.7+ for dataset loading and processing","Sufficient disk space (~500MB-1GB for full dataset with solutions)","LLM or reasoning model capable of generating multi-token mathematical expressions and symbolic reasoning","Hugging Face Datasets library with filter/select functionality","Python 3.7+ for dataset manipulation","Knowledge of difficulty scale (1-5) and subject categories to construct meaningful subsets","Hugging Face Datasets library with filter/group_by functionality","Familiarity with the 7 subject categories to construct meaningful queries","Hugging Face Datasets library to access solution text"],"failure_modes":["Dataset is static and finite (12,500 problems) — does not grow with new competition years after curation cutoff","Problems require symbolic/algebraic reasoning; limited coverage of applied mathematics or real-world problem contexts","Difficulty ratings are subjective and may not correlate uniformly with model performance across different architectures","Solutions are provided in natural language format, not machine-parseable structured representations, requiring custom parsing for automated evaluation","No built-in support for partial credit or intermediate step validation — evaluation is typically binary (correct final answer or not)","Difficulty ratings are subjective and assigned during curation — may not align with actual model-specific difficulty (e.g., a model trained on geometry may find geometry problems easier than assigned)","No dynamic difficulty adjustment based on model performance — filtering is static based on pre-assigned labels","Difficulty distribution across subjects may be uneven (e.g., more hard geometry problems than hard prealgebra problems)","Problems are assigned to a single primary subject — does not capture multi-domain problems that require knowledge from multiple subjects","Subject taxonomy is fixed at 7 categories — no hierarchical organization (e.g., Algebra is not subdivided into Linear Algebra, Polynomial Algebra)","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.7,"quality":0.8500000000000001,"ecosystem":0.3,"match_graph":0.25,"freshness":0.75,"weights":{"adoption":0.3,"quality":0.25,"ecosystem":0.1,"match_graph":0.3,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-05-24T12:16:23.328Z","last_scraped_at":null,"last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=math","compare_url":"https://unfragile.ai/compare?artifact=math"}},"signature":"B8KZCEjlEpBRogKsfd2rID3OnOvhImVkenRINnfdI4UY6lEg5Tgm30sUAI2yjrjiVWMFeXwodk0OrsB4a0hnAQ==","signedAt":"2026-06-21T08:46:02.264Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/math","artifact":"https://unfragile.ai/math","verify":"https://unfragile.ai/api/v1/verify?slug=math","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}