Arithmetic And Mathematical Reasoning Evaluation

1

ZeroEvalBenchmark63/100

via “zero-shot mathematical reasoning evaluation”

Zero-shot LLM evaluation for reasoning tasks.

Unique: Implements unified zero-shot evaluation specifically designed to isolate reasoning capability from few-shot learning effects, with multi-format answer extraction that handles LaTeX, symbolic, and natural language mathematical expressions without requiring model-specific output formatting

vs others: Differs from general LLM benchmarks (MMLU, GSM8K) by explicitly removing few-shot examples and standardizing evaluation across mathematical domains, providing cleaner signal for foundational reasoning ability

2

MATH BenchmarkBenchmark63/100

via “solution step extraction and intermediate reasoning evaluation”

12.5K competition math problems — AMC/AIME/Olympiad level, 7 subjects, standard math benchmark.

Unique: Preserves solution steps as first-class data throughout the evaluation pipeline, enabling evaluation of intermediate reasoning quality rather than just final answers. This supports emerging research on chain-of-thought prompting and interpretable AI reasoning.

vs others: More comprehensive than final-answer-only evaluation because it assesses reasoning quality and interpretability, but requires more manual annotation and is harder to automate than simple answer verification.

3

MathVistaBenchmark62/100

via “multimodal mathematical reasoning evaluation across visual domains”

Visual mathematical reasoning benchmark.

Unique: Combines visual understanding with mathematical problem-solving across three newly created datasets (IQTest, FunctionQA, PaperQA) plus 28 existing multimodal datasets, totaling 6,141 examples with explicit focus on compositional reasoning where visual perception and mathematical logic must be jointly applied. Unlike single-domain benchmarks, MathVista spans geometry, statistics, and scientific figures, exposing differential model performance across mathematical reasoning types.

vs others: Broader than domain-specific benchmarks (e.g., geometry-only or chart-only) and more rigorous than general vision-language benchmarks because it requires both accurate visual interpretation AND correct mathematical reasoning, not just image captioning or visual QA on non-mathematical content.

4

FrontierMathBenchmark61/100

via “cross-subdiscipline mathematical reasoning measurement”

Expert-level math problems created by mathematicians.

Unique: Explicitly structures evaluation across four mathematical subdisciplines (number theory, algebra, geometry, analysis) to measure generalization and identify domain-specific reasoning patterns, rather than treating mathematics as a monolithic domain

vs others: Provides subdiscipline-specific performance insights that reveal whether AI reasoning is broadly generalizable or domain-dependent, whereas most benchmarks report aggregate mathematical performance

5

BIG-Bench Hard (BBH)Dataset59/100

23 hardest BIG-Bench tasks where models initially failed.

Unique: Focuses specifically on multi-step arithmetic and mathematical reasoning through few-shot examples, isolating numerical reasoning capability from general language understanding. Tasks test both calculation accuracy and mathematical inference patterns.

vs others: More focused on mathematical reasoning than general reasoning benchmarks; more accessible than formal mathematics verification because it uses natural language problem statements rather than symbolic notation.

6

Pixtral LargeModel58/100

via “mathematical reasoning over visual data”

Mistral's 124B multimodal model with vision capabilities.

Unique: Achieves 69.4% on MathVista benchmark (outperforming all tested models) through integrated visual parsing and mathematical reasoning in a single 124B model, without requiring separate symbolic math engines or specialized mathematical libraries

vs others: Outperforms GPT-4o, Gemini-1.5 Pro, and Claude-3.5 Sonnet on MathVista while being available for self-hosted deployment, eliminating API dependency for educational or research mathematical analysis

7

Llama 3.1 405BModel57/100

via “mathematical reasoning with 96.8% gsm8k accuracy”

Largest open-weight model at 405B parameters.

Unique: 405B parameter scale enables 96.8% GSM8K performance through learned chain-of-thought patterns in transformer architecture, achieving near-human accuracy on grade-school math without external symbolic engines or calculators

vs others: Larger model scale than most open-source alternatives improves mathematical reasoning accuracy; however, lacks symbolic verification that specialized math engines provide, making it suitable for reasoning tasks but not formal proofs

8

Llama 3.3 70BModel57/100

via “mathematical reasoning with math benchmark performance”

Meta's 70B open model matching 405B-class performance.

Unique: Achieves strong mathematical reasoning performance at 70B parameters through instruction-tuning on mathematical problem-solving datasets, enabling competitive MATH benchmark performance without specialized symbolic reasoning modules

vs others: Provides mathematical reasoning capability comparable to larger closed-source models while remaining open-weight and self-hostable, though without formal verification guarantees of symbolic math systems

9

Yi-34BModel57/100

via “competitive mathematical reasoning with transformer-based arithmetic”

01.AI's bilingual 34B model with 200K context option.

Unique: Achieves competitive mathematical reasoning through general-purpose transformer pretraining without documented chain-of-thought training or specialized math fine-tuning, suggesting strong mathematical pattern learning from raw pretraining data. Supports both English and Chinese mathematical notation and problem-solving.

vs others: Delivers competitive math performance at 34B scale without specialized training overhead, reducing model size and inference cost while maintaining reasonable mathematical reasoning for educational and problem-solving applications.

10

DeepSeek Coder V2Model57/100

via “mathematical reasoning and step-by-step problem solving”

DeepSeek's 236B MoE model specialized for code.

Unique: Trained on 6 trillion tokens including mathematical reasoning datasets and code-based solutions, enabling both symbolic reasoning and code generation for mathematical problems in a single model without separate math-specific components

vs others: Provides integrated mathematical reasoning and code generation (unlike Copilot which focuses on code) while maintaining open-source weights and supporting local deployment

11

FinQADataset57/100

via “arithmetic operation type classification and execution”

8.3K financial reasoning questions over real S&P 500 earnings reports.

Unique: Embeds arithmetic operation selection within financial domain context, requiring models to understand that 'margin' semantically maps to division and 'total' maps to addition. This tests semantic grounding of operations, not just arithmetic execution.

vs others: More semantically grounded than generic math word problem datasets because operation selection is implicit in financial terminology, but less explicit than datasets with annotated operation types because operations must be inferred

12

DeepSeek V3Model57/100

via “mathematical reasoning and problem-solving”

671B MoE model matching GPT-4o at fraction of training cost.

Unique: Achieves 90.2% on MATH benchmark through MoE architecture that routes mathematical reasoning tokens through specialized expert parameters, enabling efficient scaling of reasoning capability without proportional increase in active parameters per token

vs others: Matches GPT-4o mathematical reasoning performance (90.2% MATH) while using 37B active parameters vs GPT-4o's undisclosed parameter count, reducing inference latency and cost for math-heavy workloads

13

GSM8KDataset56/100

via “multi-step mathematical reasoning benchmark evaluation”

8.5K grade school math problems — multi-step reasoning, verifiable solutions, reasoning benchmark.

Unique: Uses linguistically diverse, human-authored grade school problems (not synthetic) that require genuine multi-step reasoning with basic arithmetic, combined with a standardized answer extraction format (#### delimiter) that enables reproducible evaluation across heterogeneous model outputs

vs others: More challenging than simple arithmetic benchmarks (requires 2-8 reasoning steps) yet more accessible than advanced math benchmarks, making it ideal for measuring practical reasoning improvements in production models

14

MATHDataset56/100

via “competition-mathematics problem corpus construction and curation”

12.5K competition math problems across 7 subjects and 5 difficulty levels.

Unique: Curated from actual mathematics competitions (AMC/AIME) rather than synthetic or textbook problems, ensuring problems require genuine multi-step reasoning and cannot be solved by pattern matching alone. Includes difficulty stratification (1-5) and subject taxonomy across 7 mathematical domains, enabling fine-grained capability analysis. Verified solutions provided by domain experts, not generated by models.

vs others: More rigorous than general math benchmarks (e.g., SVAMP, MathQA) because it uses authentic competition problems with higher reasoning complexity; more comprehensive than single-domain datasets because it spans 7 mathematical subjects with 12,500 problems; more reliable than synthetic benchmarks because problems are human-authored and competition-tested.

15

o3-miniModel55/100

via “mathematical problem solving with symbolic reasoning”

Cost-efficient reasoning model with configurable effort levels.

Unique: Implements specialized mathematical reasoning patterns with step-by-step derivation generation, achieving competition-level math performance through domain-specific training rather than general reasoning

vs others: Matches o3 on mathematical benchmarks at lower cost; outperforms standard LLMs (GPT-4, Claude) on competition-level problems due to reasoning-grade capabilities

16

ARCBenchmark49/100

via “evaluation metric formulation”

Abstraction and reasoning corpus for general intelligence

Unique: The evaluation metrics are specifically tailored to assess abstract reasoning capabilities, unlike generic metrics that may not reflect reasoning depth.

vs others: Offers more nuanced evaluation than traditional benchmarks like accuracy, which may not fully capture reasoning abilities.

17

MATHDataset49/100

via “advanced mathematical problem evaluation”

Competition mathematics problems (harder than GSM8K)

Unique: MATH's dataset is specifically curated from high school math contests, providing a unique challenge that is more difficult than typical benchmarks, allowing for a clearer differentiation of model capabilities.

vs others: More challenging than GSM8K, making it a superior choice for evaluating advanced mathematical reasoning in AI models.

18

GSM8KDataset47/100

via “multi-step mathematical reasoning evaluation”

Grade school math problems requiring multi-step reasoning

Unique: GSM8K is specifically curated to include a diverse set of multi-step reasoning problems, making it more targeted than generic math datasets, allowing for precise evaluation of reasoning capabilities in LLMs.

vs others: More focused on multi-step reasoning than other benchmarks like MATH, which may include less structured problems.

19

chinese-llm-benchmarkBenchmark45/100

via “mathematical reasoning and logic problem evaluation with specialized scoring”

ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括374个大模型，覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型，以及step3.5-flash、kimi-k2.6、ernie4.5、MiniMax-M2.7、deepseek-v4、Qwen3.6、llama4、智谱GLM-5.1、MiMo-V2、LongCat、gemma4、mistral等开源大模型。不仅提供排行榜，也提供规模超200万的大

Unique: Evaluates mathematical reasoning with 1-5 quality scale for reasoning steps rather than binary correctness, enabling partial credit for correct methodology with computational errors. Combines final answer accuracy with reasoning quality assessment to capture mathematical thinking capability. Includes multi-step reasoning problems and logical inference tasks beyond simple arithmetic.

vs others: More nuanced mathematical assessment than MMLU (binary correctness) and captures reasoning quality vs answer-only evaluation

20

Mistral Large (123B)Model40/100

via “mathematical reasoning and symbolic computation”

Mistral Large — powerful reasoning and instruction-following

Top Matches

Also Known As

Company