Mathematical Reasoning With 96 8 Gsm8k Accuracy

1

Llama 3.1 405BModel57/100

via “mathematical reasoning with 96.8% gsm8k accuracy”

Largest open-weight model at 405B parameters.

Unique: 405B parameter scale enables 96.8% GSM8K performance through learned chain-of-thought patterns in transformer architecture, achieving near-human accuracy on grade-school math without external symbolic engines or calculators

vs others: Larger model scale than most open-source alternatives improves mathematical reasoning accuracy; however, lacks symbolic verification that specialized math engines provide, making it suitable for reasoning tasks but not formal proofs

2

Mixtral 8x22BModel57/100

via “mathematical-reasoning-with-instruction-tuning”

Mistral's mixture-of-experts model with 176B total parameters.

Unique: Achieves 90.8% on GSM8K through instruction-tuning that teaches explicit step-by-step mathematical reasoning, with majority voting over 8 samples. This approach trades inference cost (8x sampling) for accuracy, making it suitable for applications where reasoning transparency is valued over single-sample speed.

vs others: Strong grade-school math performance (90.8% GSM8K) comparable to GPT-3.5-turbo; weaker on competition-level math (44.6% MATH) than GPT-4 or specialized math models; open-source licensing enables fine-tuning for domain-specific math tasks.

3

GSM8KDataset57/100

via “benchmark dataset for evaluating mathematical reasoning in language models”

8.5K grade school math problems — multi-step reasoning, verifiable solutions, reasoning benchmark.

Unique: GSM8K uniquely combines linguistic diversity with multi-step reasoning challenges specifically tailored for language models.

vs others: Unlike other datasets, GSM8K focuses specifically on multi-step arithmetic problems that are challenging yet solvable by middle school students, providing a clear benchmark for AI capabilities.

4

GSM8KDataset47/100

via “multi-step mathematical reasoning evaluation”

Grade school math problems requiring multi-step reasoning

Unique: GSM8K is specifically curated to include a diverse set of multi-step reasoning problems, making it more targeted than generic math datasets, allowing for precise evaluation of reasoning capabilities in LLMs.

vs others: More focused on multi-step reasoning than other benchmarks like MATH, which may include less structured problems.

Top Matches

Also Known As

Company