{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"mmlu","slug":"mmlu","name":"MMLU","type":"benchmark","url":"https://github.com/hendrycks/test","page_url":"https://unfragile.ai/mmlu","categories":["testing-quality"],"tags":[],"pricing":{"model":"free","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"mmlu__cap_0","uri":"capability://data.processing.analysis.few.shot.multitask.evaluation.across.57.knowledge.domains","name":"few-shot multitask evaluation across 57 knowledge domains","description":"Executes standardized few-shot prompting evaluation on language models across 57 subjects (STEM, humanities, social sciences, professional) by constructing few-shot prompts with 5 example question-answer pairs per subject, then measuring accuracy on held-out test sets. The system uses a hierarchical subject organization (e.g., STEM → physics → high school physics) and aggregates results at subject, category, and overall levels to produce granular performance metrics.","intents":["I need to benchmark my LLM against a standard that covers breadth of knowledge across multiple domains","I want to understand where my model performs well and poorly across different knowledge categories","I need reproducible evaluation results that are comparable to published leaderboards and research papers"],"best_for":["LLM researchers evaluating foundation models and fine-tuned variants","ML engineers comparing model performance before/after training or instruction-tuning","Teams building general-purpose AI systems that need broad knowledge coverage validation"],"limitations":["Multiple-choice format doesn't capture reasoning depth or explain-ability — models can guess correctly without understanding","57 subjects provide breadth but limited depth per subject (typically 50-100 questions per subject)","Few-shot evaluation (5 examples) may not reflect zero-shot or many-shot performance patterns","No evaluation of reasoning steps or intermediate work — only final answer correctness"],"requires":["Language model with API or local inference capability","MMLU dataset (15,908 questions across 57 subjects) — available in hendrycks/test repository","Python 3.6+ for running evaluation scripts","Model must support prompt-based inference (text input → text output)"],"input_types":["text prompts (few-shot examples + test question)","multiple-choice questions with 4 options (A, B, C, D)"],"output_types":["single-character answer (A/B/C/D)","accuracy scores per subject (0-100%)","aggregated accuracy across categories and overall benchmark"],"categories":["data-processing-analysis","benchmark-evaluation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mmlu__cap_1","uri":"capability://text.generation.language.prompt.generation.with.few.shot.example.formatting","name":"prompt generation with few-shot example formatting","description":"Constructs few-shot prompts by formatting subject name, selecting 5 in-context examples from the training set, and appending the test question with multiple-choice options. The system implements format_subject() to normalize subject names, format_example() to structure each example as 'Question: ... Options: A) ... B) ... C) ... D) ... Answer: X', and gen_prompt() to concatenate examples with the target question. This approach ensures consistent prompt structure across all 57 subjects and enables reproducible few-shot evaluation.","intents":["I need to generate consistent few-shot prompts for evaluating models on a specific subject","I want to ensure prompt formatting doesn't introduce bias or inconsistency across different subjects","I need to understand what examples are being used to evaluate my model's few-shot learning capability"],"best_for":["Researchers studying few-shot learning behavior across knowledge domains","Teams implementing MMLU evaluation in custom evaluation pipelines","Developers extending MMLU with custom subjects or prompt templates"],"limitations":["Fixed 5-example few-shot format — no support for zero-shot or variable-shot evaluation without code modification","Example selection is deterministic (first 5 training examples) — no randomization or stratified sampling","Prompt format is rigid (Question → Options → Answer) — no support for alternative prompt templates or chain-of-thought formatting","No prompt optimization or in-context learning strategies (e.g., example reordering, semantic similarity-based selection)"],"requires":["MMLU dataset with train/test split per subject","Subject name mapping (e.g., 'abstract_algebra' → 'Abstract Algebra')","Python 3.6+ with string formatting capabilities"],"input_types":["subject name (string, e.g., 'physics')","training examples (list of dicts with 'question', 'choices', 'answer' keys)","test question (dict with same structure)"],"output_types":["formatted prompt string (text)","prompt with embedded examples and target question"],"categories":["text-generation-language","prompt-engineering"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mmlu__cap_2","uri":"capability://data.processing.analysis.context.aware.prompt.truncation.via.bpe.tokenization","name":"context-aware prompt truncation via bpe tokenization","description":"Truncates prompts to fit within model context windows using Byte Pair Encoding (BPE) tokenization. The crop.py system encodes prompts to BPE tokens, truncates to a maximum of 2048 tokens, and decodes back to text while preserving semantic coherence. This approach automatically downloads encoder resources (e.g., GPT-2 tokenizer) if not available locally and ensures prompts fit within typical model context limits without manual length estimation.","intents":["I need to ensure prompts fit within my model's context window without manual length calculation","I want to automatically handle long prompts by truncating them intelligently rather than failing","I need consistent tokenization across different models that may use different tokenizers"],"best_for":["Evaluating models with limited context windows (e.g., older models, edge deployments)","Automated evaluation pipelines that need to handle variable-length prompts robustly","Researchers studying the impact of context length on few-shot learning performance"],"limitations":["Fixed 2048-token limit — no support for dynamic limits based on model capabilities","BPE tokenization may not align with model's actual tokenizer (e.g., models using SentencePiece or Tiktoken)","Truncation is lossy — removes examples or question context, potentially degrading evaluation validity","No intelligent truncation strategy (e.g., preserve examples, truncate question) — simple token-count-based truncation"],"requires":["Python 3.6+ with tiktoken or equivalent BPE encoder library","Internet access to download encoder resources on first run (cached locally thereafter)","Text input (prompt string)"],"input_types":["prompt text (string of arbitrary length)"],"output_types":["truncated prompt text (string, max 2048 BPE tokens)"],"categories":["data-processing-analysis","text-generation-language"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mmlu__cap_3","uri":"capability://data.processing.analysis.model.calibration.measurement.across.confidence.metrics","name":"model calibration measurement across confidence metrics","description":"Measures how well-calibrated model predictions are using multiple calibration metrics: Expected Calibration Error (ECE), Static Calibration Error (SCE), Root Mean Square Calibration Error (RMSCE), Adaptive Calibration Error (ACE), and Threshold Adaptive Calibration Error (TACE). The calib_tools.py system supports different binning schemes (uniform, adaptive) and normalization methods, enabling analysis of whether model confidence scores align with actual accuracy across prediction classes. This is critical for understanding model reliability beyond raw accuracy.","intents":["I need to understand if my model's confidence scores are reliable indicators of correctness","I want to measure whether my model is overconfident or underconfident across different prediction classes","I need to compare calibration quality across different models or training approaches"],"best_for":["ML engineers building production systems where confidence scores drive downstream decisions (e.g., routing to human review)","Researchers studying model reliability and uncertainty quantification","Teams evaluating whether fine-tuning or instruction-tuning improves model calibration"],"limitations":["Requires model to output confidence scores or probabilities — doesn't work with models that only output discrete answers","Calibration metrics are aggregate statistics — don't identify which specific classes or domains are miscalibrated","Different metrics (ECE, SCE, ACE) can rank models differently — no consensus on which metric is most important","Binning scheme choice significantly impacts results — uniform vs adaptive binning can yield different conclusions"],"requires":["Model predictions with confidence scores (probabilities for each class)","Ground truth labels for evaluation set","Python 3.6+ with numpy for metric computation"],"input_types":["predictions (list of predicted class labels)","confidence scores (list of floats 0-1 for each prediction)","ground truth labels (list of true class labels)"],"output_types":["calibration metrics (dict with ECE, SCE, RMSCE, ACE, TACE values)","calibration plots (optional visualization of confidence vs accuracy)"],"categories":["data-processing-analysis","safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mmlu__cap_4","uri":"capability://data.processing.analysis.hierarchical.subject.organization.and.result.aggregation","name":"hierarchical subject organization and result aggregation","description":"Organizes 57 subjects into a hierarchical taxonomy (e.g., STEM → Physics → High School Physics) and aggregates evaluation results at multiple levels: per-subject accuracy, per-category accuracy (e.g., all STEM subjects), and overall benchmark accuracy. The system uses categories.py to define the hierarchy and evaluate_flan.py to compute aggregated metrics, enabling both fine-grained analysis (which specific subjects are weak) and high-level comparison (overall model capability). This hierarchical structure mirrors how knowledge is organized in educational systems.","intents":["I need to understand my model's performance across different knowledge domains, not just an overall score","I want to identify which subject areas my model struggles with to guide further training or fine-tuning","I need to compare models at different levels of granularity (overall vs category vs subject)"],"best_for":["Researchers analyzing model strengths and weaknesses across knowledge domains","Teams building domain-specific models and wanting to validate breadth of knowledge","Educators or curriculum designers studying how LLMs understand different subjects"],"limitations":["Subject categorization is fixed — no support for custom hierarchies or alternative taxonomies","Aggregation is simple averaging — doesn't weight subjects by importance or difficulty","57 subjects provide broad coverage but uneven depth (some subjects have 50 questions, others 100+)","No per-question analysis — only subject-level and above aggregation"],"requires":["MMLU dataset with subject labels and category mappings","Python 3.6+ with dict/list operations for aggregation","Per-subject accuracy scores from evaluation"],"input_types":["per-subject accuracy scores (dict mapping subject name to accuracy %)","subject-to-category mapping (dict defining hierarchy)"],"output_types":["per-subject accuracy (dict)","per-category accuracy (dict)","overall benchmark accuracy (float 0-100)"],"categories":["data-processing-analysis","planning-reasoning"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mmlu__cap_5","uri":"capability://automation.workflow.standardized.evaluation.harness.with.reproducible.model.testing","name":"standardized evaluation harness with reproducible model testing","description":"Provides a complete evaluation harness (evaluate_flan.py) that orchestrates the entire MMLU evaluation workflow: loading dataset, generating few-shot prompts, querying models, collecting predictions, computing accuracy, and aggregating results. The main() function coordinates these steps with configurable parameters (model selection, number of examples, output paths), ensuring reproducible evaluation across different models and runs. This harness abstracts away implementation details and provides a standard interface for model evaluation.","intents":["I want to evaluate my model on MMLU with minimal custom code","I need reproducible evaluation results that match published benchmarks","I want to compare multiple models using the same evaluation protocol"],"best_for":["ML engineers and researchers evaluating models without deep benchmark expertise","Teams integrating MMLU evaluation into CI/CD pipelines or automated testing","Researchers publishing results that need to be reproducible and comparable to prior work"],"limitations":["Harness is specific to FLAN models — extending to other model families requires code modification","No built-in support for batch evaluation or distributed evaluation across multiple GPUs/TPUs","Results are written to CSV files — no structured output format (JSON, database) for programmatic access","No caching of model outputs — re-running evaluation requires re-querying the model"],"requires":["MMLU dataset (hendrycks/test repository)","Model implementation (FLAN or compatible interface)","Python 3.6+ with file I/O and CSV writing capabilities","Sufficient compute to run inference on 15,908 questions"],"input_types":["model identifier or path","dataset path","configuration parameters (num_examples, output_path, etc.)"],"output_types":["CSV files with per-subject and aggregated results","accuracy metrics (per-subject, per-category, overall)"],"categories":["automation-workflow","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mmlu__cap_6","uri":"capability://data.processing.analysis.structured.subject.category.taxonomy.and.hierarchical.organization","name":"structured subject category taxonomy and hierarchical organization","description":"Defines and maintains a hierarchical taxonomy of 57 subjects organized into 4 high-level categories (STEM, humanities, social sciences, professional). The categories.py module encodes this taxonomy as a structured data structure (likely a dictionary or class hierarchy) that maps subjects to categories, enabling consistent categorization across the evaluation pipeline. This taxonomy is used throughout the evaluation process for subject-level result aggregation, category-level analysis, and leaderboard organization.","intents":["Organize 57 subjects into meaningful high-level categories for performance analysis and reporting","Enable category-level performance comparison (e.g., STEM vs humanities) to identify knowledge distribution patterns","Maintain consistent subject-to-category mapping across all evaluation runs and publications","Support custom analysis and filtering by category or subject"],"best_for":["Researchers analyzing model performance patterns across knowledge domains","Teams publishing MMLU results with category-level breakdowns","Developers building analysis tools that need subject-to-category mappings"],"limitations":["Taxonomy is fixed and immutable — cannot add new subjects or reorganize categories without modifying source code","Category definitions are coarse-grained (4 categories) — may obscure fine-grained performance patterns within categories","No weighting or importance ranking — all subjects treated equally in aggregation","Taxonomy reflects English-language academic knowledge structure — may not align with other knowledge organization systems"],"requires":["categories.py file with subject-to-category mappings","Python 3.7+ for importing and using taxonomy"],"input_types":["Subject name (string, e.g., 'abstract_algebra')"],"output_types":["Category name (string, e.g., 'STEM')","List of subjects in a category","Complete taxonomy as structured data"],"categories":["data-processing-analysis","memory-knowledge"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mmlu__headline","uri":"capability://testing.quality.comprehensive.benchmark.for.evaluating.language.model.understanding.across.multiple.subjects","name":"comprehensive benchmark for evaluating language model understanding across multiple subjects","description":"The Massive Multitask Language Understanding (MMLU) benchmark is a widely recognized tool for assessing the performance of language models across a diverse range of subjects including STEM, humanities, and social sciences, making it essential for evaluating general language understanding capabilities.","intents":["best language model benchmark","benchmark for evaluating language understanding","MMLU for language models","how to test language models","top benchmarks for LLM evaluation"],"best_for":["researchers","developers","AI practitioners"],"limitations":["requires a compatible language model"],"requires":["language model","evaluation setup"],"input_types":["text prompts"],"output_types":["evaluation scores","performance metrics"],"categories":["testing-quality"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":61,"verified":false,"data_access_risk":"high","permissions":["Language model with API or local inference capability","MMLU dataset (15,908 questions across 57 subjects) — available in hendrycks/test repository","Python 3.6+ for running evaluation scripts","Model must support prompt-based inference (text input → text output)","MMLU dataset with train/test split per subject","Subject name mapping (e.g., 'abstract_algebra' → 'Abstract Algebra')","Python 3.6+ with string formatting capabilities","Python 3.6+ with tiktoken or equivalent BPE encoder library","Internet access to download encoder resources on first run (cached locally thereafter)","Text input (prompt string)"],"failure_modes":["Multiple-choice format doesn't capture reasoning depth or explain-ability — models can guess correctly without understanding","57 subjects provide breadth but limited depth per subject (typically 50-100 questions per subject)","Few-shot evaluation (5 examples) may not reflect zero-shot or many-shot performance patterns","No evaluation of reasoning steps or intermediate work — only final answer correctness","Fixed 5-example few-shot format — no support for zero-shot or variable-shot evaluation without code modification","Example selection is deterministic (first 5 training examples) — no randomization or stratified sampling","Prompt format is rigid (Question → Options → Answer) — no support for alternative prompt templates or chain-of-thought formatting","No prompt optimization or in-context learning strategies (e.g., example reordering, semantic similarity-based selection)","Fixed 2048-token limit — no support for dynamic limits based on model capabilities","BPE tokenization may not align with model's actual tokenizer (e.g., models using SentencePiece or Tiktoken)","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.7,"quality":0.8500000000000001,"ecosystem":0.39999999999999997,"match_graph":0.25,"freshness":0.52,"weights":{"adoption":0.25,"quality":0.35,"ecosystem":0.15,"match_graph":0.2,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-06-17T09:51:04.693Z","last_scraped_at":null,"last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=mmlu","compare_url":"https://unfragile.ai/compare?artifact=mmlu"}},"signature":"Ud5cGV/3qoRyAcX4FJI3t9gbBMNU0jFyuuBid/rguzfJHxer7bMJLI++aLwoI6AIgGBPDvszrlv2NtQQNjI5BA==","signedAt":"2026-06-21T13:07:20.764Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/mmlu","artifact":"https://unfragile.ai/mmlu","verify":"https://unfragile.ai/api/v1/verify?slug=mmlu","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}