competition-mathematics problem corpus construction and curation
Aggregates 12,500 hand-curated competition mathematics problems sourced from AMC (American Mathematics Competitions), AIME (American Invitational Mathematics Examination), and other prestigious math olympiads. Problems are structured with metadata including difficulty ratings (1-5 scale), subject classification across 7 domains, and complete step-by-step solutions. The curation process filters for problems that require genuine mathematical reasoning rather than pattern matching, enabling reliable evaluation of model reasoning depth.
Unique: Curated from actual mathematics competitions (AMC/AIME) rather than synthetic or textbook problems, ensuring problems require genuine multi-step reasoning and cannot be solved by pattern matching alone. Includes difficulty stratification (1-5) and subject taxonomy across 7 mathematical domains, enabling fine-grained capability analysis. Verified solutions provided by domain experts, not generated by models.
vs alternatives: More rigorous than general math benchmarks (e.g., SVAMP, MathQA) because it uses authentic competition problems with higher reasoning complexity; more comprehensive than single-domain datasets because it spans 7 mathematical subjects with 12,500 problems; more reliable than synthetic benchmarks because problems are human-authored and competition-tested.
difficulty-stratified problem sampling and filtering
Enables selective sampling of problems across a 5-level difficulty scale, allowing researchers to construct evaluation sets tailored to specific model capability ranges. The difficulty metadata is pre-assigned during curation, enabling efficient filtering without re-evaluation. This supports progressive evaluation strategies where models are first tested on easier problems (difficulty 1-2) before advancing to harder ones (difficulty 4-5), reducing computational waste on problems beyond a model's current capability.
Unique: Pre-assigned difficulty metadata (1-5 scale) from competition context enables efficient filtering without re-evaluation, unlike datasets where difficulty must be computed post-hoc. Difficulty labels are grounded in actual competition difficulty (AMC problems are easier, AIME problems are harder), providing meaningful stratification.
vs alternatives: More efficient than datasets requiring dynamic difficulty estimation because filtering is O(1) lookup on metadata; more reliable than model-specific difficulty metrics because it uses competition-grounded labels that generalize across model architectures.
subject-domain problem categorization and retrieval
Organizes 12,500 problems into 7 distinct mathematical subject categories (Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, Precalculus), enabling domain-specific evaluation and analysis. Each problem is tagged with its primary subject during curation, allowing researchers to isolate performance on specific mathematical domains and identify capability gaps (e.g., a model may excel at algebra but struggle with geometry). Supports both filtering and aggregation queries across subject boundaries.
Unique: Problems are curated and tagged with subject metadata from their original competition context, ensuring accurate domain classification. The 7-subject taxonomy reflects the structure of actual mathematics competitions, making it meaningful for evaluating mathematical reasoning across recognized disciplines.
vs alternatives: More granular than generic math benchmarks that treat all math problems uniformly; more reliable than automatic subject classification because tags are assigned by domain experts during curation, not inferred post-hoc; enables domain-specific analysis that generic benchmarks cannot support.
step-by-step solution annotation and verification
Each of the 12,500 problems includes detailed step-by-step solutions that decompose the problem-solving process into intermediate reasoning steps. Solutions are provided in natural language format with mathematical notation, enabling evaluation of not just final answers but also intermediate reasoning quality. This supports training and evaluation of chain-of-thought reasoning models, where the ability to generate correct intermediate steps is as important as reaching the correct final answer. Solutions are verified by domain experts during curation, ensuring correctness.
Unique: Solutions are expert-verified and provided as part of the dataset curation, not generated post-hoc by models. This ensures high-quality ground truth for training and evaluation. Solutions include intermediate reasoning steps in natural language, enabling evaluation of reasoning quality beyond final answer correctness.
vs alternatives: More valuable than datasets with only final answers because it enables chain-of-thought training and intermediate step evaluation; more reliable than model-generated solutions because they are human-authored and verified; more detailed than simple answer keys because it includes full reasoning paths.
benchmark performance tracking and historical comparison
Provides a stable, unchanging evaluation set that enables longitudinal tracking of model performance improvements over time. The dataset's fixed composition (12,500 problems) and expert-curated solutions allow researchers to compare results across different model versions, architectures, and training approaches using identical evaluation conditions. Historical performance data (e.g., GPT-3 at 6.9%, o3 and DeepSeek R1 at 90%+) is tracked and published, enabling researchers to contextualize new model performance against established baselines.
Unique: Fixed, expert-curated dataset enables stable longitudinal benchmarking without dataset drift or contamination. Published historical performance data (GPT-3 6.9% → o3/DeepSeek R1 90%+) provides context for new results. Difficulty stratification and subject taxonomy enable fine-grained performance analysis beyond single accuracy scores.
vs alternatives: More stable than dynamic benchmarks that change over time because the problem set is frozen; more reliable than leaderboards without published solutions because results can be independently verified; more informative than single-point benchmarks because historical data enables trend analysis and contextualization.
multi-subject balanced evaluation set construction
Enables construction of evaluation sets with balanced representation across the 7 mathematical subjects, ensuring that benchmark results are not skewed by subject-specific performance variations. Researchers can programmatically sample equal numbers of problems from each subject (e.g., 100 problems per subject for a 700-problem evaluation set) or weight sampling by subject difficulty distribution. This supports fair, representative evaluation that reflects overall mathematical reasoning capability rather than performance on a single domain.
Unique: Subject metadata enables programmatic construction of balanced evaluation sets without manual curation. The 7-subject taxonomy provides a natural framework for balancing, unlike datasets with coarse or overlapping categories.
vs alternatives: More flexible than fixed evaluation sets because it supports custom weighting and sampling; more fair than unbalanced datasets because it ensures equal representation across domains; more reproducible than manual curation because sampling is deterministic and can be seeded.