{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"mmmu","slug":"mmmu","name":"MMMU","type":"benchmark","url":"https://mmmu-benchmark.github.io","page_url":"https://unfragile.ai/mmmu","categories":["testing-quality"],"tags":[],"pricing":{"model":"free","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"mmmu__cap_0","uri":"capability://data.processing.analysis.expert.level.multimodal.reasoning.evaluation.across.30.college.subjects","name":"expert-level multimodal reasoning evaluation across 30 college subjects","description":"Evaluates AI models on 11,500 expert-level questions spanning 6 disciplines (Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, Tech & Engineering) and 183 subfields, requiring simultaneous perception of heterogeneous visual modalities (charts, diagrams, chemical structures, music sheets, maps, tables) and application of college-level domain knowledge with deliberate multi-step reasoning. Questions are sourced from actual college exams, textbooks, and lectures to ensure authentic difficulty and real-world relevance.","intents":["Measure whether my multimodal AI model can handle expert-level domain reasoning across diverse academic subjects","Benchmark my model's performance against GPT-4V and other state-of-the-art multimodal systems on a standardized, comprehensive evaluation","Identify which academic disciplines and visual modalities my model struggles with to guide architecture improvements","Validate that my model generalizes beyond common benchmarks to college-level problems requiring integrated perception and knowledge"],"best_for":["AI research teams developing multimodal large language models (LLMs) and vision-language models (VLMs)","Organizations evaluating commercial multimodal AI systems (GPT-4V, Claude, Gemini) for domain-specific applications","Academic institutions benchmarking student-built AI systems against established baselines","Enterprise teams assessing whether multimodal AI is ready for professional knowledge work (medicine, engineering, business analysis)"],"limitations":["Benchmark is static and college-level scoped — does not measure real-time interactive reasoning, multi-turn dialogue, or adversarial robustness","Scoring methodology not explicitly documented in public materials — exact match vs. partial credit scoring formula unknown","Train/dev/test split ratios and data contamination analysis not publicly disclosed, creating uncertainty about overlap with LLM training corpora","No published analysis of demographic biases in question selection or subject representation balance across the 30 subjects","Evaluation requires either remote submission to EvalAI server or local execution environment — no lightweight API-based evaluation option documented","Performance ceiling not yet saturated (GPT-4V at 56% accuracy) but no analysis of whether benchmark has inherent ceiling effects at higher capability levels"],"requires":["Access to Hugging Face datasets (MMMU or MMMU-Pro versions)","Multimodal AI model capable of processing images and text simultaneously","Python environment for local evaluation (specific version requirements unknown)","Either EvalAI account for remote submission or local compute resources for batch evaluation","Image processing capability supporting 30+ heterogeneous visual modality types (charts, diagrams, chemical structures, music sheets, maps, tables)"],"input_types":["image (30 heterogeneous types: charts, diagrams, maps, tables, music sheets, chemical structures, photographs, illustrations, etc.)","text (college-level questions in multiple-choice or free-form format — exact format not publicly documented)","structured metadata (subject classification, discipline, difficulty level, image type)"],"output_types":["accuracy score (percentage correct across all 11,500 questions)","per-discipline breakdown (accuracy across 6 core disciplines)","per-subject breakdown (accuracy across 30 college subjects)","per-modality breakdown (accuracy by image type: charts vs. diagrams vs. chemical structures, etc.)","leaderboard ranking (comparative performance vs. GPT-4V baseline and other evaluated models)"],"categories":["data-processing-analysis","benchmark-evaluation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mmmu__cap_1","uri":"capability://data.processing.analysis.discipline.specific.performance.stratification.and.diagnostic.breakdown","name":"discipline-specific performance stratification and diagnostic breakdown","description":"Provides granular performance metrics stratified across 6 core academic disciplines (Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, Tech & Engineering) and 183 subfields, enabling identification of which knowledge domains and subject areas a model excels or struggles with. Leaderboard and evaluation infrastructure expose per-discipline accuracy, per-subject accuracy, and per-visual-modality accuracy to support targeted model improvement and domain-specific capability assessment.","intents":["Identify which academic disciplines my model is weakest in to prioritize training data or architecture changes","Determine whether my model has balanced capability across STEM vs. humanities vs. professional domains or shows systematic gaps","Assess whether my model's performance on visual reasoning (charts, diagrams) differs from performance on domain knowledge (biology, history, business)","Compare my model's discipline-specific strengths against competitors to find competitive advantages in specific domains"],"best_for":["Model developers optimizing multimodal systems for specific professional domains (e.g., medical AI, engineering design tools)","Research teams analyzing failure modes and knowledge gaps in vision-language models","Enterprise teams selecting multimodal AI for domain-specific applications (e.g., medical diagnosis support, legal document analysis)"],"limitations":["Granular per-subject breakdown available on leaderboard but specific scores for all 183 subfields not documented in public materials","No analysis of inter-discipline correlations — unknown whether models that excel in Science also excel in Tech & Engineering","Visual modality breakdown (charts vs. diagrams vs. chemical structures) mentioned but specific per-modality accuracy scores not published in available documentation","No error analysis or failure case categorization — cannot determine whether failures are due to visual perception, domain knowledge gaps, or reasoning errors"],"requires":["Access to official MMMU leaderboard with per-discipline and per-subject score breakdowns","Model evaluation on full 11,500-question dataset to generate statistically meaningful per-discipline scores","Discipline and subject metadata annotations for each question in the dataset"],"input_types":["model predictions (per-question outputs from evaluated multimodal model)","ground truth labels (correct answers with discipline/subject/modality classifications)"],"output_types":["per-discipline accuracy (6 values: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, Tech & Engineering)","per-subject accuracy (30 values across college subjects)","per-modality accuracy (breakdown by chart, diagram, chemical structure, music sheet, map, table, etc.)","comparative leaderboard rankings by discipline"],"categories":["data-processing-analysis","benchmark-evaluation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mmmu__cap_2","uri":"capability://image.visual.heterogeneous.visual.modality.evaluation.with.domain.specific.visual.types","name":"heterogeneous visual modality evaluation with domain-specific visual types","description":"Evaluates multimodal model performance across 30 distinct visual modality types including domain-specific visuals (chemical structures, music sheets, mathematical diagrams) alongside common types (charts, tables, maps, photographs, illustrations). The benchmark explicitly tests whether models can perceive and reason over specialized visual representations used in professional and academic contexts, not just natural images or generic diagrams.","intents":["Test whether my multimodal model can handle domain-specific visual representations (chemical structures, music notation, engineering diagrams) required for professional applications","Measure perception accuracy across diverse visual modality types to identify which visual formats my model struggles with","Validate that my model generalizes beyond natural images and simple charts to complex, domain-specific visual content","Benchmark visual perception capability independently from domain knowledge to isolate perception bottlenecks"],"best_for":["AI teams building domain-specific multimodal systems (chemistry analysis, music transcription, engineering design review)","Organizations evaluating whether commercial multimodal models can handle specialized visual content in their industry","Research teams studying visual perception in vision-language models across diverse modality types"],"limitations":["Specific list of all 30 visual modality types not fully enumerated in public documentation — only partial list (charts, diagrams, maps, tables, music sheets, chemical structures) provided","Per-modality accuracy scores not published in available materials — cannot directly compare model performance on charts vs. chemical structures","No analysis of visual complexity or OCR difficulty per modality type — unknown whether performance gaps are due to perception or reasoning","Domain-specific visual types (chemical structures, music sheets) may require specialized preprocessing or tokenization not documented in public materials"],"requires":["Multimodal model with vision encoder capable of processing diverse visual formats (raster images, vector diagrams, structured notation)","Support for domain-specific visual parsing (e.g., chemical structure interpretation, music notation recognition)","Access to MMMU dataset with visual modality type annotations for each question"],"input_types":["image (30 types: charts, diagrams, maps, tables, music sheets, chemical structures, photographs, illustrations, and 22 additional types not enumerated)"],"output_types":["per-modality accuracy (breakdown of model performance by visual type)","modality-specific error analysis (which visual types cause failures)","visual perception capability score (independent of domain knowledge)"],"categories":["image-visual","benchmark-evaluation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mmmu__cap_3","uri":"capability://automation.workflow.remote.and.local.evaluation.infrastructure.with.dual.submission.pathways","name":"remote and local evaluation infrastructure with dual submission pathways","description":"Provides two evaluation pathways: (1) remote submission via EvalAI server (established 2023-12-04) with test set answers released for local verification (2026-02-12), and (2) local evaluation capability enabling offline batch evaluation of models on the full 11,500-question dataset. The dual infrastructure supports both cloud-based leaderboard submission and self-hosted evaluation for organizations with data privacy or latency constraints.","intents":["Submit my model to the official MMMU leaderboard via EvalAI for comparative ranking against GPT-4V and other baselines","Evaluate my model locally on the full test set without uploading predictions to a remote server for privacy or compliance reasons","Iterate rapidly on model improvements by running local evaluation without waiting for remote leaderboard processing","Validate my model's performance before submitting to the official leaderboard to avoid failed submissions"],"best_for":["Research teams publishing results on official leaderboard for peer review and reproducibility","Enterprise organizations with data privacy requirements preventing cloud-based evaluation","Model developers iterating rapidly and requiring fast feedback loops (local evaluation)","Teams with proprietary models unable to share predictions with third-party evaluation services"],"limitations":["EvalAI submission format and API specification not documented in public materials — submission process requires reverse-engineering or accessing code repository","Local evaluation requires downloading full 11,500-question dataset and setting up evaluation environment — no lightweight API-based evaluation option","Test set answers released on 2026-02-12 (future date in documentation) — unclear whether answers are currently available or evaluation is still in progress","No documented SLA or evaluation turnaround time for EvalAI submissions — leaderboard update frequency unknown","Evaluation infrastructure dependencies (Python version, required libraries, compute requirements) not specified in public documentation"],"requires":["For remote evaluation: EvalAI account and submission credentials","For local evaluation: Python 3.9+ (assumed, not explicitly stated), Hugging Face datasets library, local compute resources for batch inference","Model inference capability supporting batch processing of 11,500 questions","Submission format specification (JSON, CSV, or other — not documented)"],"input_types":["model predictions (per-question outputs in unspecified format)","question IDs (to map predictions to ground truth)"],"output_types":["accuracy score (overall and per-discipline)","leaderboard ranking (for remote submissions)","evaluation report (for local submissions)"],"categories":["automation-workflow","benchmark-evaluation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mmmu__cap_4","uri":"capability://planning.reasoning.multimodal.perception.and.knowledge.integration.assessment","name":"multimodal perception and knowledge integration assessment","description":"Explicitly evaluates three integrated capabilities: (1) perception (understanding diverse visual modalities), (2) knowledge (domain-specific subject expertise), and (3) reasoning (deliberate multi-step reasoning over multimodal inputs). Questions are designed to require simultaneous visual understanding and domain knowledge application, preventing models from succeeding through either perception alone or knowledge lookup alone. This integration testing approach validates end-to-end multimodal reasoning rather than isolated sub-capabilities.","intents":["Measure whether my model can integrate visual perception with domain knowledge to solve expert-level problems, not just recognize images or recall facts","Identify whether my model's failures are due to visual perception gaps, knowledge gaps, or reasoning deficiencies","Validate that my model performs genuine reasoning over multimodal inputs rather than pattern-matching or memorization","Assess whether my model's multimodal integration is robust across diverse visual types and knowledge domains"],"best_for":["AI research teams developing integrated multimodal reasoning systems","Organizations evaluating whether multimodal AI can handle real-world expert tasks requiring simultaneous perception and knowledge","Teams building domain-specific AI assistants (medical diagnosis, engineering design review) where perception and knowledge must be integrated"],"limitations":["No published error analysis distinguishing perception failures from knowledge failures from reasoning failures — cannot isolate which capability is the bottleneck","Reasoning complexity not quantified — unknown whether questions require 2-step or 10-step reasoning chains","No ablation studies showing performance impact of removing visual information or domain knowledge context","Scoring methodology not documented — unclear whether partial credit is given for correct reasoning with incorrect final answer"],"requires":["Multimodal model with integrated vision and language understanding (not separate vision and language modules)","Reasoning capability supporting multi-step inference over visual and textual inputs","Domain knowledge across 30 college subjects and 183 subfields"],"input_types":["image (visual modality required for question context)","text (question text requiring interpretation and reasoning)"],"output_types":["accuracy score (overall integration capability)","per-capability breakdown (if error analysis available: perception vs. knowledge vs. reasoning)","reasoning trace (if model supports chain-of-thought output)"],"categories":["planning-reasoning","benchmark-evaluation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mmmu__cap_5","uri":"capability://data.processing.analysis.mmmu.pro.robust.variant.with.enhanced.evaluation.rigor","name":"mmmu-pro robust variant with enhanced evaluation rigor","description":"MMMU-Pro (introduced 2024-09-05) is a refined version of the base MMMU benchmark designed for more robust multimodal AI evaluation. The distinction from base MMMU is not fully documented in public materials, but the designation as 'robust' suggests improvements in question quality, answer verification, or evaluation methodology to reduce noise and improve benchmark reliability.","intents":["Evaluate my model on a more rigorous version of MMMU with improved question quality or answer verification","Benchmark against MMMU-Pro leaderboard to compare against models evaluated on the robust variant","Understand differences between base MMMU and MMMU-Pro to determine which version is appropriate for my evaluation needs"],"best_for":["Research teams publishing results and requiring maximum benchmark rigor and reproducibility","Organizations comparing models across both MMMU and MMMU-Pro to understand robustness improvements"],"limitations":["Distinction between MMMU and MMMU-Pro not documented in public materials — specific improvements unknown","Scale of MMMU-Pro dataset unknown (11,500 questions like base MMMU or smaller curated subset?)","No published comparison of base MMMU vs. MMMU-Pro performance to quantify robustness improvements","Unclear whether MMMU-Pro is a superset, subset, or alternative version of base MMMU"],"requires":["Access to MMMU-Pro dataset on Hugging Face","Understanding of differences between MMMU and MMMU-Pro (not publicly documented)"],"input_types":["image (multimodal questions)","text (questions and answers)"],"output_types":["accuracy score on MMMU-Pro variant","leaderboard ranking on MMMU-Pro"],"categories":["data-processing-analysis","benchmark-evaluation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mmmu__cap_6","uri":"capability://data.processing.analysis.human.expert.baseline.and.comparative.performance.analysis","name":"human expert baseline and comparative performance analysis","description":"Provides human expert performance baseline on the full 11,500-question dataset, enabling assessment of whether AI models are approaching or exceeding human-level performance on expert-level multimodal reasoning tasks. The leaderboard (updated 2024-01-31) includes human expert scores, allowing direct comparison of AI model performance against domain expert accuracy.","intents":["Determine whether my model's performance is approaching human expert level or significantly below","Understand the performance gap between my model and human experts to assess readiness for real-world deployment","Benchmark my model against human baselines rather than just other AI models to contextualize performance","Identify domains where my model exceeds human performance vs. where significant gaps remain"],"best_for":["Organizations evaluating whether multimodal AI is ready to augment or replace human experts in specific domains","Research teams analyzing the relationship between AI and human performance on expert-level tasks","Teams building AI systems for professional knowledge work (medicine, law, engineering) where human baseline is critical"],"limitations":["Specific human expert accuracy scores not provided in available documentation — only reference to leaderboard availability","Human expert methodology not documented — unclear whether experts were domain specialists, generalists, or mixed","No analysis of inter-expert agreement or expert confidence — unknown whether human baseline is reliable","Per-discipline human performance not published — cannot determine which domains have larger AI-human gaps","No analysis of whether human experts had access to the same visual information as AI models"],"requires":["Access to official MMMU leaderboard with human expert baseline scores","Model evaluation on full 11,500-question dataset for meaningful comparison"],"input_types":["model predictions (per-question outputs)","human expert predictions (ground truth or reference)"],"output_types":["human expert accuracy baseline (overall and per-discipline)","AI vs. human performance gap (absolute and percentage)","comparative leaderboard ranking (AI models vs. human experts)"],"categories":["data-processing-analysis","benchmark-evaluation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mmmu__cap_7","uri":"capability://data.processing.analysis.college.level.authentic.sourcing.from.exams.textbooks.and.lectures","name":"college-level authentic sourcing from exams, textbooks, and lectures","description":"Questions are explicitly sourced from authentic college-level materials (exams, textbooks, lectures) rather than synthetic generation or crowdsourcing, ensuring real-world difficulty, relevance, and alignment with actual academic standards. This sourcing approach guarantees that benchmark questions reflect genuine expert-level reasoning requirements rather than artificial or simplified tasks, and reduces risk of benchmark gaming through memorization of synthetic patterns.","intents":["Evaluate my model on authentic expert-level problems that reflect real-world academic and professional reasoning requirements","Ensure my model evaluation is not inflated by synthetic or simplified benchmark questions","Validate that my model can handle the same difficulty level and reasoning complexity as college-educated professionals","Assess whether my model's performance on MMMU correlates with real-world expert task performance"],"best_for":["Organizations evaluating multimodal AI for professional knowledge work (medicine, law, engineering, business)","Research teams studying whether AI performance on synthetic benchmarks correlates with real-world expert task performance","Teams building AI systems for educational applications (tutoring, assessment) where authentic academic content is critical"],"limitations":["Sourcing methodology not documented — unclear which colleges, textbooks, or exams were used","No analysis of temporal data leakage — unknown whether exam/textbook content overlaps with LLM training data","No documentation of content curation or quality control process — unclear whether all sourced content met expert-level standards","Copyright and licensing status of sourced content not documented — unclear whether benchmark respects intellectual property rights","No analysis of whether authentic sourcing introduces systematic biases (e.g., over-representation of certain textbooks or institutions)"],"requires":["Access to source materials (exams, textbooks, lectures) used to create benchmark questions","Verification that questions are authentic and not synthetic derivatives"],"input_types":["college-level exam questions (with images and text)","textbook problems (with diagrams and explanatory text)","lecture materials (with slides, notes, and visual aids)"],"output_types":["benchmark questions (11,500 authentic expert-level problems)","source attribution (reference to original exam, textbook, or lecture)"],"categories":["data-processing-analysis","benchmark-evaluation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mmmu__headline","uri":"capability://testing.quality.multimodal.understanding.benchmark.for.ai.models","name":"multimodal understanding benchmark for ai models","description":"The MMMU benchmark is a comprehensive evaluation tool designed to assess the multimodal understanding and reasoning capabilities of AI models across various disciplines, featuring 11,500 expert-level questions that require college-level domain knowledge.","intents":["best multimodal understanding benchmark","benchmark for evaluating AI reasoning skills","top multimodal benchmarks for AI models","multimodal evaluation tools for AI","AI model assessment benchmarks for college-level tasks"],"best_for":["evaluating AI models' reasoning capabilities"],"limitations":[],"requires":[],"input_types":[],"output_types":[],"categories":["testing-quality"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":61,"verified":false,"data_access_risk":"low","permissions":["Access to Hugging Face datasets (MMMU or MMMU-Pro versions)","Multimodal AI model capable of processing images and text simultaneously","Python environment for local evaluation (specific version requirements unknown)","Either EvalAI account for remote submission or local compute resources for batch evaluation","Image processing capability supporting 30+ heterogeneous visual modality types (charts, diagrams, chemical structures, music sheets, maps, tables)","Access to official MMMU leaderboard with per-discipline and per-subject score breakdowns","Model evaluation on full 11,500-question dataset to generate statistically meaningful per-discipline scores","Discipline and subject metadata annotations for each question in the dataset","Multimodal model with vision encoder capable of processing diverse visual formats (raster images, vector diagrams, structured notation)","Support for domain-specific visual parsing (e.g., chemical structure interpretation, music notation recognition)"],"failure_modes":["Benchmark is static and college-level scoped — does not measure real-time interactive reasoning, multi-turn dialogue, or adversarial robustness","Scoring methodology not explicitly documented in public materials — exact match vs. partial credit scoring formula unknown","Train/dev/test split ratios and data contamination analysis not publicly disclosed, creating uncertainty about overlap with LLM training corpora","No published analysis of demographic biases in question selection or subject representation balance across the 30 subjects","Evaluation requires either remote submission to EvalAI server or local execution environment — no lightweight API-based evaluation option documented","Performance ceiling not yet saturated (GPT-4V at 56% accuracy) but no analysis of whether benchmark has inherent ceiling effects at higher capability levels","Granular per-subject breakdown available on leaderboard but specific scores for all 183 subfields not documented in public materials","No analysis of inter-discipline correlations — unknown whether models that excel in Science also excel in Tech & Engineering","Visual modality breakdown (charts vs. diagrams vs. chemical structures) mentioned but specific per-modality accuracy scores not published in available documentation","No error analysis or failure case categorization — cannot determine whether failures are due to visual perception, domain knowledge gaps, or reasoning errors","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.7,"quality":0.8500000000000001,"ecosystem":0.3,"match_graph":0.25,"freshness":0.75,"weights":{"adoption":0.25,"quality":0.35,"ecosystem":0.15,"match_graph":0.2,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-05-24T12:16:23.328Z","last_scraped_at":null,"last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=mmmu","compare_url":"https://unfragile.ai/compare?artifact=mmmu"}},"signature":"H6vKyJANHRLsHsVLkaHgPfYHvVHXIYBaIjpaLFfiWicKx4p5XVqyoztcTHjfc8TUoci5KJHKaLD1e59uXPd3BQ==","signedAt":"2026-06-22T17:44:01.790Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/mmmu","artifact":"https://unfragile.ai/mmmu","verify":"https://unfragile.ai/api/v1/verify?slug=mmmu","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}