multi-step mathematical reasoning benchmark evaluation
Evaluates language models' ability to perform 2-8 step mathematical reasoning on grade school word problems through a curated dataset of 8,500 problems split into 7.5K training and 1K test examples. The evaluation framework extracts final answers marked with #### delimiters and compares them against ground truth, enabling precise measurement of multi-step reasoning accuracy across model architectures and sizes.
Unique: Uses linguistically diverse, human-authored grade school problems (not synthetic) that require genuine multi-step reasoning with basic arithmetic, combined with a standardized answer extraction format (#### delimiter) that enables reproducible evaluation across heterogeneous model outputs
vs alternatives: More challenging than simple arithmetic benchmarks (requires 2-8 reasoning steps) yet more accessible than advanced math benchmarks, making it ideal for measuring practical reasoning improvements in production models
calculator-integrated solution generation with annotation-based computation
Enables language models to generate mathematically correct solutions by embedding calculation annotations in the format <<expression=result>> within generated text. During training, models learn these annotations as normal tokens; during inference, a calculator system detects expressions between << and >> delimiters, evaluates them accurately, and replaces them with computed results, preventing arithmetic errors in multi-step chains.
Unique: Dual-mode annotation system where the same <<expression=result>> format serves as training signal (models learn to produce it) and inference hook (calculator detects and evaluates it), creating a learnable interface between language generation and deterministic computation without requiring separate tool-calling infrastructure
vs alternatives: Simpler than external tool-calling APIs (no function registry or schema negotiation needed) and more interpretable than black-box arithmetic, but less flexible than full function-calling systems for complex operations
socratic-format guided reasoning dataset with subquestion decomposition
Provides an alternative dataset format (train_socratic.jsonl, test_socratic.jsonl) where each problem is augmented with intermediate Socratic subquestions that guide step-by-step reasoning. This format enables training models to decompose problems into smaller reasoning steps before solving, improving interpretability and potentially reducing errors in multi-step chains by enforcing explicit intermediate reasoning.
Unique: Augments standard problems with human-authored Socratic subquestions that decompose reasoning into explicit intermediate steps, creating a structured reasoning scaffold that models can learn from without requiring external prompting or chain-of-thought engineering
vs alternatives: More structured than zero-shot chain-of-thought prompting (reasoning steps are baked into training data) but less flexible than dynamic prompting systems that generate subquestions at inference time
standardized answer extraction and correctness comparison
Implements a deterministic answer extraction pipeline that parses generated solutions to locate the final answer marked with #### delimiter, extracts the numeric value, and compares it against ground truth answers from the dataset. This enables automated evaluation of solution correctness without manual inspection, supporting batch evaluation across thousands of model outputs with consistent, reproducible metrics.
Unique: Uses a simple, language-agnostic delimiter format (####) for answer marking that works across any model output format, combined with numeric comparison logic that handles floating-point precision and integer equivalence, enabling consistent evaluation without model-specific parsing
vs alternatives: More robust than regex-based answer extraction (explicit delimiter is unambiguous) and more scalable than manual evaluation, but less sophisticated than semantic similarity metrics that could credit partially correct reasoning
linguistically diverse problem corpus with controlled reasoning complexity
Curates 8,500 human-authored grade school math word problems with explicit control over reasoning complexity (2-8 steps per problem) and linguistic diversity to prevent models from exploiting surface-level patterns. The dataset balances problem difficulty, operation types, and linguistic variation to create a robust benchmark that measures genuine mathematical reasoning rather than pattern matching or memorization.
Unique: Human-authored problems with explicit step-count constraints (2-8 steps) and linguistic diversity ensure that models cannot solve problems through surface-level pattern matching or memorization, forcing evaluation of genuine multi-step reasoning capability
vs alternatives: More challenging than synthetic or template-based benchmarks (human authorship prevents exploitable patterns) and more stable than crowdsourced datasets (controlled authorship ensures consistency), but smaller than web-scraped math problem collections
example model solutions with multi-size performance reference
Provides pre-generated solutions from models of varying sizes (available in example_model_solutions.jsonl) that serve as reference implementations and performance baselines. These solutions demonstrate how different model scales approach the same problems, enabling researchers to study scaling laws in mathematical reasoning and to validate evaluation infrastructure against known model outputs.
Unique: Pre-computed solutions from multiple model sizes in a single standardized file enable direct comparison of how model scale affects reasoning quality without requiring researchers to re-run inference on large models, reducing computational overhead for benchmarking studies
vs alternatives: More convenient than running inference on reference models yourself (no compute cost) but less flexible than dynamic baselines that could be updated as new models emerge
json lines format dataset serialization with streaming support
Stores all problems and solutions in JSON Lines format (.jsonl), where each line is a complete, self-contained JSON object representing one problem-solution pair. This format enables efficient streaming loading of large datasets without loading entire files into memory, supports line-by-line processing in data pipelines, and allows easy integration with distributed training frameworks that process data in batches.
Unique: Uses line-delimited JSON format that enables streaming processing without loading entire dataset into memory, combined with self-contained problem-solution pairs that allow independent processing of each example in distributed training pipelines
vs alternatives: More memory-efficient than monolithic JSON files and more human-readable than binary formats, but slower for random access than indexed databases or columnar formats like Parquet
training and inference pipeline integration with model sampling
Provides infrastructure for training models on GSM8K data and generating solutions through sampling-based inference. The pipeline handles data loading, model fine-tuning, solution generation with temperature/sampling parameters, and integration with the calculator system to ensure arithmetic correctness. This enables end-to-end workflows from raw dataset to evaluated model performance without external tooling.
Unique: Integrates dataset loading, model training, solution generation, calculator evaluation, and answer extraction into a single end-to-end pipeline, with sampling-based inference that allows temperature control for exploring solution diversity while maintaining arithmetic correctness through calculator integration
vs alternatives: More complete than standalone dataset (includes training and inference code) but less flexible than modular frameworks that allow swapping components; tightly integrated for GSM8K but requires customization for other tasks