MMLU
BenchmarkFree57-subject knowledge benchmark — 15K+ questions across STEM, humanities, professional domains.
Capabilities7 decomposed
few-shot multitask evaluation across 57 knowledge domains
Medium confidenceExecutes standardized few-shot prompting evaluation on language models across 57 subjects (STEM, humanities, social sciences, professional) by constructing few-shot prompts with 5 example question-answer pairs per subject, then measuring accuracy on held-out test sets. The system uses a hierarchical subject organization (e.g., STEM → physics → high school physics) and aggregates results at subject, category, and overall levels to produce granular performance metrics.
Organizes 15,908 questions hierarchically across 57 subjects with standardized few-shot prompting (5 examples per subject) and aggregates results at multiple granularity levels (subject, category, overall), enabling both broad coverage assessment and fine-grained domain analysis in a single evaluation run
Broader coverage than domain-specific benchmarks (57 subjects vs 1-5) and more standardized than ad-hoc evaluation, making it the de facto general knowledge benchmark for LLM comparison in research and industry
prompt generation with few-shot example formatting
Medium confidenceConstructs few-shot prompts by formatting subject name, selecting 5 in-context examples from the training set, and appending the test question with multiple-choice options. The system implements format_subject() to normalize subject names, format_example() to structure each example as 'Question: ... Options: A) ... B) ... C) ... D) ... Answer: X', and gen_prompt() to concatenate examples with the target question. This approach ensures consistent prompt structure across all 57 subjects and enables reproducible few-shot evaluation.
Implements standardized prompt formatting functions (format_subject, format_example, gen_prompt) that ensure consistent structure across all 57 subjects, enabling reproducible few-shot evaluation and reducing prompt-induced variance in model performance measurement
More reproducible than manual prompt engineering and more standardized than ad-hoc formatting, ensuring that performance differences reflect model capability rather than prompt variation
context-aware prompt truncation via bpe tokenization
Medium confidenceTruncates prompts to fit within model context windows using Byte Pair Encoding (BPE) tokenization. The crop.py system encodes prompts to BPE tokens, truncates to a maximum of 2048 tokens, and decodes back to text while preserving semantic coherence. This approach automatically downloads encoder resources (e.g., GPT-2 tokenizer) if not available locally and ensures prompts fit within typical model context limits without manual length estimation.
Implements automatic BPE-based prompt truncation with local caching of encoder resources, enabling context-aware evaluation without manual prompt length management or model-specific tokenizer configuration
More robust than character-count-based truncation (which doesn't account for tokenization) and more general than model-specific truncation (which requires per-model configuration)
model calibration measurement across confidence metrics
Medium confidenceMeasures how well-calibrated model predictions are using multiple calibration metrics: Expected Calibration Error (ECE), Static Calibration Error (SCE), Root Mean Square Calibration Error (RMSCE), Adaptive Calibration Error (ACE), and Threshold Adaptive Calibration Error (TACE). The calib_tools.py system supports different binning schemes (uniform, adaptive) and normalization methods, enabling analysis of whether model confidence scores align with actual accuracy across prediction classes. This is critical for understanding model reliability beyond raw accuracy.
Implements five distinct calibration metrics (ECE, SCE, RMSCE, ACE, TACE) with configurable binning schemes and normalization methods, enabling comprehensive analysis of model confidence calibration beyond simple accuracy measurement
More comprehensive than single-metric calibration (e.g., ECE alone) and more flexible than fixed binning schemes, allowing researchers to identify calibration issues across different granularities and binning strategies
hierarchical subject organization and result aggregation
Medium confidenceOrganizes 57 subjects into a hierarchical taxonomy (e.g., STEM → Physics → High School Physics) and aggregates evaluation results at multiple levels: per-subject accuracy, per-category accuracy (e.g., all STEM subjects), and overall benchmark accuracy. The system uses categories.py to define the hierarchy and evaluate_flan.py to compute aggregated metrics, enabling both fine-grained analysis (which specific subjects are weak) and high-level comparison (overall model capability). This hierarchical structure mirrors how knowledge is organized in educational systems.
Implements hierarchical subject organization (57 subjects grouped into 4 major categories: STEM, humanities, social sciences, other) with multi-level result aggregation, enabling both granular subject-level analysis and high-level category comparison in a single evaluation framework
More structured than flat subject lists and more informative than single overall scores, enabling researchers to identify domain-specific weaknesses and guide targeted model improvements
standardized evaluation harness with reproducible model testing
Medium confidenceProvides a complete evaluation harness (evaluate_flan.py) that orchestrates the entire MMLU evaluation workflow: loading dataset, generating few-shot prompts, querying models, collecting predictions, computing accuracy, and aggregating results. The main() function coordinates these steps with configurable parameters (model selection, number of examples, output paths), ensuring reproducible evaluation across different models and runs. This harness abstracts away implementation details and provides a standard interface for model evaluation.
Provides a complete, self-contained evaluation harness that handles dataset loading, prompt generation, model querying, result collection, and aggregation in a single orchestrated workflow, eliminating the need for custom evaluation code
More complete than individual evaluation functions and more reproducible than manual evaluation scripts, enabling consistent benchmarking across teams and time periods
structured subject category taxonomy and hierarchical organization
Medium confidenceDefines and maintains a hierarchical taxonomy of 57 subjects organized into 4 high-level categories (STEM, humanities, social sciences, professional). The categories.py module encodes this taxonomy as a structured data structure (likely a dictionary or class hierarchy) that maps subjects to categories, enabling consistent categorization across the evaluation pipeline. This taxonomy is used throughout the evaluation process for subject-level result aggregation, category-level analysis, and leaderboard organization.
Encodes a structured taxonomy of 57 subjects into 4 categories as a centralized, reusable data structure (categories.py), enabling consistent categorization across all evaluation and analysis code. This separation of taxonomy definition from evaluation logic allows researchers to analyze results at multiple levels of granularity without duplicating category mappings.
Provides a centralized, version-controlled taxonomy compared to ad-hoc category definitions scattered across analysis scripts, ensuring consistency and enabling reproducible category-level analysis across publications.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with MMLU, ranked by overlap. Discovered automatically through the match graph.
Qwen3-8B
text-generation model by undefined. 1,00,18,533 downloads.
MiniMax: MiniMax M2.1
MiniMax-M2.1 is a lightweight, state-of-the-art large language model optimized for coding, agentic workflows, and modern application development. With only 10 billion activated parameters, it delivers a major jump in real-world...
OpenAI: GPT-5.2
GPT-5.2 is the latest frontier-grade model in the GPT-5 series, offering stronger agentic and long context perfomance compared to GPT-5.1. It uses adaptive reasoning to allocate computation dynamically, responding quickly...
Qwen2.5-0.5B-Instruct
text-generation model by undefined. 61,45,130 downloads.
GPT-NeoX-20B: An Open-Source Autoregressive Language Model (GPT-NeoX)
* ⭐ 04/2022: [PaLM: Scaling Language Modeling with Pathways (PaLM)](https://arxiv.org/abs/2204.02311)
Qwen2.5-3B-Instruct
text-generation model by undefined. 92,07,977 downloads.
Best For
- ✓LLM researchers evaluating foundation models and fine-tuned variants
- ✓ML engineers comparing model performance before/after training or instruction-tuning
- ✓Teams building general-purpose AI systems that need broad knowledge coverage validation
- ✓Researchers studying few-shot learning behavior across knowledge domains
- ✓Teams implementing MMLU evaluation in custom evaluation pipelines
- ✓Developers extending MMLU with custom subjects or prompt templates
- ✓Evaluating models with limited context windows (e.g., older models, edge deployments)
- ✓Automated evaluation pipelines that need to handle variable-length prompts robustly
Known Limitations
- ⚠Multiple-choice format doesn't capture reasoning depth or explain-ability — models can guess correctly without understanding
- ⚠57 subjects provide breadth but limited depth per subject (typically 50-100 questions per subject)
- ⚠Few-shot evaluation (5 examples) may not reflect zero-shot or many-shot performance patterns
- ⚠No evaluation of reasoning steps or intermediate work — only final answer correctness
- ⚠Fixed 5-example few-shot format — no support for zero-shot or variable-shot evaluation without code modification
- ⚠Example selection is deterministic (first 5 training examples) — no randomization or stratified sampling
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Massive Multitask Language Understanding. 15,908 questions across 57 subjects (STEM, humanities, social sciences, professional). Tests broad knowledge and problem-solving. The most widely reported general LLM benchmark.
Categories
Alternatives to MMLU
Are you the builder of MMLU?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →