{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"mathvista","slug":"mathvista","name":"MathVista","type":"benchmark","url":"https://mathvista.github.io","page_url":"https://unfragile.ai/mathvista","categories":["testing-quality"],"tags":[],"pricing":{"model":"free","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"mathvista__cap_0","uri":"capability://data.processing.analysis.multimodal.mathematical.reasoning.evaluation.across.visual.domains","name":"multimodal mathematical reasoning evaluation across visual domains","description":"Evaluates multimodal models' ability to interpret visual mathematical representations (geometry diagrams, statistical charts, scientific figures) and perform compositional reasoning combining visual perception with mathematical problem-solving. The benchmark uses a curated dataset of 6,141 examples sourced from 28 existing multimodal datasets plus 3 newly created datasets (IQTest, FunctionQA, PaperQA), with questions presented in multiple-choice and free-form generation formats. Scoring uses exact-match accuracy on the testmini subset (1,000 examples) exposed via a public leaderboard.","intents":["Assess whether a multimodal model can correctly interpret complex visual mathematical content and derive accurate answers","Benchmark progress on compositional visual-mathematical reasoning tasks to identify capability gaps in current LMMs","Compare model performance across different mathematical domains (geometry, statistics, scientific figures) to understand domain-specific weaknesses","Establish baseline performance metrics for new multimodal architectures before deployment in mathematical reasoning applications"],"best_for":["AI researchers evaluating multimodal large language models (LMMs) on mathematical reasoning","Teams developing vision-language models targeting STEM education or scientific analysis","Benchmark maintainers tracking progress on compositional visual-mathematical understanding","Organizations assessing whether GPT-4V, Gemini, or open-source LMMs meet mathematical reasoning requirements"],"limitations":["No inter-annotator agreement metrics or annotation quality documentation provided, limiting confidence in ground truth labels","No data contamination analysis against LLM/LMM training corpora — risk that source datasets or similar content appears in model training data","Performance ceiling at ~60% human accuracy suggests benchmark may not saturate current SOTA, but no analysis of whether gap reflects genuine capability limits or annotation ambiguity","Exact task format distribution (multiple-choice vs. free-form percentages) unknown, preventing targeted evaluation of specific reasoning types","No statistical significance testing between model comparisons — reported accuracy differences may not be statistically meaningful","Evaluation methodology for GPT-4V was manual via playground chatbot, not standardized API evaluation, introducing potential inconsistency","No breakdown of performance by mathematical domain or visual context type in public documentation, limiting diagnostic capability"],"requires":["Access to multimodal model (GPT-4V, Gemini Ultra, Bard, or open-source LMM with vision capabilities)","Ability to process and display images in multiple formats (JPEG, PNG, PDF figures)","Python 3.7+ for dataset loading and evaluation script execution","Hugging Face account for dataset access (free tier sufficient)","GPU or TPU for efficient batch evaluation of large models (optional but recommended for full benchmark run)"],"input_types":["image (geometry diagrams, statistical charts, scientific figures, IQ test problems, function graphs, research paper figures)","text (multiple-choice options, free-form question prompts)","auxiliary text (OCR-extracted text from images, image captions for text-only model baselines)"],"output_types":["structured data (accuracy scores per model, per domain, per visual context type)","leaderboard rankings (testmini subset performance)","model predictions (text answers for free-form tasks, selected options for multiple-choice)"],"categories":["data-processing-analysis","benchmark-evaluation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mathvista__cap_1","uri":"capability://data.processing.analysis.visual.mathematical.dataset.curation.and.annotation","name":"visual mathematical dataset curation and annotation","description":"Aggregates and curates 6,141 mathematical reasoning examples from 28 existing multimodal datasets and 3 newly created datasets (IQTest, FunctionQA, PaperQA) with standardized question-answer pairs. The curation process involves selecting examples that require compositional visual-mathematical reasoning, extracting or generating questions, and providing auxiliary annotations (OCR text, image captions) for text-only model baselines. Dataset is hosted on Hugging Face and includes a visualization tool for exploring examples by mathematical domain and visual context type.","intents":["Access a large, curated collection of visual mathematical reasoning examples for training or fine-tuning multimodal models","Analyze distribution of mathematical reasoning types and visual complexity across a diverse dataset","Use auxiliary annotations (OCR, captions) to evaluate text-only models or hybrid approaches on visual mathematical tasks","Explore individual examples via the visualization tool to understand failure modes or benchmark difficulty"],"best_for":["Researchers training or fine-tuning multimodal models on mathematical reasoning tasks","Teams analyzing what visual-mathematical reasoning patterns their models struggle with","Educators or curriculum designers studying how visual representations affect mathematical problem-solving","Benchmark users wanting to understand dataset composition and example characteristics"],"limitations":["Dataset composition bias unknown — no documentation of how many examples come from each of the 28 source datasets, risking overrepresentation of certain visual styles or mathematical domains","No explicit documentation of inter-annotator agreement for newly created datasets (IQTest, FunctionQA, PaperQA), limiting confidence in label quality","Exact train/dev/test split sizes and stratification strategy unknown, preventing reproducible dataset partitioning","No analysis of visual complexity distribution or difficulty calibration across examples","Auxiliary annotations (OCR, captions) quality and coverage not documented — unclear whether all examples have both or only subset","No version control or changelog for dataset updates, risking inconsistency if examples are modified post-publication"],"requires":["Hugging Face account (free tier sufficient for dataset download)","Python 3.7+ with datasets library for programmatic access","Image processing library (PIL, OpenCV) to load and manipulate visual examples","Storage capacity for 6,141 images plus metadata (estimated 2-5 GB depending on image resolution)"],"input_types":["image (geometry diagrams, statistical charts, scientific figures, IQ test problems, function graphs, research paper figures)","text (question prompts, multiple-choice options, ground truth answers)"],"output_types":["structured dataset (JSON/JSONL with image paths, questions, answers, metadata)","visualization (web-based tool for browsing examples by domain and visual context)","auxiliary text (OCR-extracted text, image captions for text-only baselines)"],"categories":["data-processing-analysis","memory-knowledge"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mathvista__cap_10","uri":"capability://tool.use.integration.open.source.dataset.and.code.availability","name":"open-source dataset and code availability","description":"MathVista is released as open-source with dataset available on Hugging Face and code available on GitHub (links provided), enabling researchers to download, analyze, and build upon the benchmark. Open-source release facilitates reproducibility, enables community contributions, and lowers barriers to adoption. Researchers can access raw data, evaluation code, and visualization tools without proprietary restrictions.","intents":["Download and access MathVista dataset for research or model development","Reproduce benchmark evaluation and verify published results","Build upon benchmark by creating new models or evaluation approaches","Contribute improvements or extensions to benchmark (if community contributions are accepted)"],"best_for":["Academic researchers wanting to reproduce or build upon benchmark","Open-source model developers wanting to evaluate models on benchmark","Teams wanting to use benchmark without proprietary licensing restrictions","Community members wanting to contribute improvements or extensions"],"limitations":["Open-source release does not guarantee code quality, documentation, or maintenance — code may be research-grade rather than production-ready","No documentation of code license or contribution guidelines — unclear whether community contributions are accepted or how to contribute","No documentation of code repository structure or setup instructions — researchers may need to reverse-engineer how to use code","Open-source release may enable benchmark contamination if models are trained on publicly available dataset — no mechanism to prevent this"],"requires":["Git and GitHub account for accessing code repository","Python 3.7+ for running evaluation code","Hugging Face account (free tier sufficient) for dataset access"],"input_types":["code (evaluation scripts, dataset loading utilities)","dataset (images, questions, answers, metadata)"],"output_types":["downloaded dataset and code for local use","evaluation results (accuracy metrics, leaderboard submissions)"],"categories":["tool-use-integration","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mathvista__cap_11","uri":"capability://data.processing.analysis.multi.source.dataset.aggregation.and.standardization","name":"multi-source dataset aggregation and standardization","description":"Aggregates examples from 28 existing multimodal datasets plus 3 newly created datasets (IQTest, FunctionQA, PaperQA) into a unified benchmark with standardized question-answer format and consistent evaluation protocol. This aggregation approach combines diverse sources (existing datasets covering various visual-mathematical domains plus new datasets targeting specific reasoning types) into a single coherent benchmark. Standardization enables fair comparison across models and reduces bias from any single source's annotation style or problem distribution.","intents":["Combine diverse sources of visual-mathematical examples into unified benchmark","Reduce bias from any single dataset's annotation style or problem distribution","Enable evaluation on diverse visual-mathematical domains (geometry, statistics, scientific figures) in single benchmark","Create comprehensive benchmark covering multiple visual-mathematical reasoning types"],"best_for":["Researchers wanting comprehensive benchmark covering multiple visual-mathematical domains","Teams wanting to evaluate models on diverse visual-mathematical reasoning types","Benchmark maintainers wanting to reduce bias from single-source datasets","Organizations wanting to assess model performance across broad range of visual-mathematical tasks"],"limitations":["Dataset composition bias unknown — no documentation of how many examples come from each of 28 source datasets, risking overrepresentation of certain visual styles or mathematical domains","Standardization process not documented — unclear how examples from different sources were normalized or whether inconsistencies remain","No analysis of whether aggregated dataset has consistent difficulty distribution or whether some sources are significantly harder/easier than others","No documentation of whether examples from different sources have different annotation quality or agreement levels","Aggregation may introduce biases if source datasets have systematic biases that are not addressed during standardization"],"requires":["Access to 28 existing multimodal datasets plus 3 new datasets","Ability to standardize examples from different sources into unified format","Metadata tracking which examples come from which source (for analysis of source bias)"],"input_types":["examples from 28 existing multimodal datasets (format varies by source)","examples from 3 newly created datasets (IQTest, FunctionQA, PaperQA)"],"output_types":["unified benchmark dataset (6,141 examples in standardized format)","metadata tracking example sources and characteristics"],"categories":["data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mathvista__cap_2","uri":"capability://data.processing.analysis.leaderboard.based.model.performance.tracking.and.comparison","name":"leaderboard-based model performance tracking and comparison","description":"Maintains a public leaderboard (testmini subset, 1,000 examples) tracking multimodal model performance on mathematical reasoning tasks with exact-match accuracy as the primary metric. The leaderboard displays rankings of models (GPT-4V at 49.9%, Gemini Ultra, Bard at ~34.8%, and others) and enables comparison of model capabilities across visual mathematical domains. Leaderboard is updated as new model submissions are evaluated, providing a living benchmark of progress in multimodal mathematical reasoning.","intents":["Compare performance of different multimodal models (GPT-4V, Gemini, Bard, open-source LMMs) on a standardized mathematical reasoning benchmark","Track progress over time as new models are released and evaluated on the same benchmark","Identify which models are best-suited for visual mathematical reasoning tasks in production applications","Motivate model development teams to improve performance on compositional visual-mathematical reasoning"],"best_for":["AI researchers benchmarking multimodal models against published SOTA","Product teams selecting which LMM to integrate for mathematical reasoning features","Model developers tracking their own model's performance relative to competitors","Organizations evaluating whether GPT-4V or open-source alternatives meet their mathematical reasoning requirements"],"limitations":["Leaderboard submission process and evaluation protocol not documented — unclear how new models are added or evaluated","No statistical significance testing between model rankings — reported accuracy differences may not be statistically meaningful (e.g., 49.9% vs 34.8% gap is large, but smaller gaps may be noise)","Evaluation methodology inconsistent across models — GPT-4V evaluated via manual playground chatbot, while others may use different protocols, introducing potential bias","No confidence intervals or error bars on reported accuracies, limiting ability to assess reliability of rankings","Leaderboard uses only testmini subset (1,000 examples) rather than full dataset, potentially introducing variance in rankings","No breakdown of performance by mathematical domain or visual context type on leaderboard, limiting diagnostic capability","No documentation of whether models were specifically fine-tuned on MathVista or related datasets, risking contamination of rankings"],"requires":["Access to multimodal model (API key for GPT-4V, Gemini, Bard, or local inference setup for open-source LMMs)","Ability to submit model predictions to leaderboard (submission mechanism not documented)","Internet access to view leaderboard at https://mathvista.github.io"],"input_types":["model predictions (text answers for free-form tasks, selected options for multiple-choice)","ground truth labels (from testmini subset)"],"output_types":["leaderboard rankings (model name, accuracy score, rank)","performance comparison visualizations (accuracy by model, by domain, by visual context type)"],"categories":["data-processing-analysis","search-retrieval"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mathvista__cap_3","uri":"capability://data.processing.analysis.auxiliary.text.annotation.for.text.only.model.evaluation","name":"auxiliary text annotation for text-only model evaluation","description":"Provides OCR-extracted text and image captions for each visual example, enabling evaluation of text-only models (e.g., GPT-4 without vision) as baselines on visual mathematical reasoning tasks. This allows researchers to isolate the contribution of visual understanding vs. text-based reasoning by comparing text-only model performance (using OCR + captions) against multimodal model performance (using images). The auxiliary annotations reveal whether models can solve mathematical problems from text descriptions alone or require direct visual interpretation.","intents":["Evaluate text-only models (GPT-4, Claude, Llama) on visual mathematical reasoning by providing OCR and caption text","Quantify the performance gap between text-only and multimodal models to understand the value of visual understanding","Identify which mathematical reasoning types benefit most from visual interpretation vs. text description","Establish text-only baselines for comparison with multimodal model performance"],"best_for":["Researchers analyzing the contribution of visual understanding to mathematical reasoning","Teams evaluating whether text-only models with OCR/captions can substitute for multimodal models","Organizations with text-only model infrastructure wanting to benchmark on visual mathematical tasks","Researchers studying how visual representations affect mathematical problem-solving difficulty"],"limitations":["Quality and completeness of OCR text not documented — unclear whether OCR accurately captures all mathematical notation, symbols, and spatial relationships in complex figures","Caption generation methodology not documented — unclear whether captions are human-written, automatically generated, or hybrid, affecting their quality and informativeness","No analysis of how OCR/caption quality affects text-only model performance — unclear whether performance gaps reflect genuine visual understanding requirements or caption/OCR limitations","Auxiliary annotations may not fully capture visual information (e.g., spatial relationships, color coding, diagram structure) that multimodal models can directly perceive","No documentation of whether all examples have both OCR and captions or only subset, limiting consistency of text-only evaluation"],"requires":["Access to auxiliary text annotations (OCR, captions) from MathVista dataset","Text-only model with API access (GPT-4, Claude, Llama, etc.)","Ability to format OCR/caption text as model input (prompt engineering)"],"input_types":["text (OCR-extracted text from images, image captions, question prompts)"],"output_types":["model predictions (text answers for free-form tasks, selected options for multiple-choice)","performance metrics (accuracy of text-only models, comparison with multimodal model accuracy)"],"categories":["data-processing-analysis","text-generation-language"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mathvista__cap_4","uri":"capability://data.processing.analysis.visual.mathematical.domain.specific.performance.analysis","name":"visual mathematical domain-specific performance analysis","description":"Enables analysis of model performance across distinct mathematical domains (geometry, statistics, scientific figures) and visual context types, revealing which reasoning types and visual representations challenge models most. The benchmark structure supports stratified evaluation where accuracy can be computed separately for each domain, allowing researchers to identify capability gaps (e.g., models may excel at statistics but struggle with geometry). Documentation mentions performance varies significantly across mathematical reasoning types and visual context types, though specific breakdowns are not provided in public leaderboard.","intents":["Identify which mathematical domains (geometry, statistics, scientific figures) are most challenging for a given multimodal model","Diagnose whether model failures are due to visual interpretation, mathematical reasoning, or both","Prioritize model improvements by focusing on weakest mathematical domains","Compare models' relative strengths across different visual mathematical reasoning types"],"best_for":["Researchers analyzing multimodal model capabilities across mathematical domains","Teams developing domain-specific applications (e.g., geometry tutoring, statistical analysis) wanting to assess model suitability","Model developers identifying which mathematical reasoning types need improvement","Educators studying how visual representations affect learning in different mathematical domains"],"limitations":["Detailed performance breakdown by mathematical domain and visual context type not provided in public documentation or leaderboard — researchers must compute breakdowns themselves from raw predictions","No analysis of whether performance differences across domains are statistically significant or due to random variation","No documentation of example distribution across domains — unclear whether all domains have equal representation or some are overrepresented","No analysis of confounding factors (e.g., whether geometry problems are harder because they require more visual understanding or because they are inherently more complex mathematically)","No guidance on how to stratify evaluation or which domain groupings are most meaningful for analysis"],"requires":["Access to full MathVista dataset with domain labels for each example","Ability to compute accuracy metrics stratified by domain (Python with pandas/numpy recommended)","Model predictions on full dataset (not just testmini leaderboard subset)"],"input_types":["model predictions (text answers for free-form tasks, selected options for multiple-choice)","ground truth labels with domain annotations (geometry, statistics, scientific figures, etc.)","example metadata (visual context type, problem source, difficulty estimate)"],"output_types":["stratified accuracy metrics (accuracy by mathematical domain, by visual context type)","performance comparison visualizations (accuracy breakdown across domains)","diagnostic reports (which domains are most challenging, which models excel at which domains)"],"categories":["data-processing-analysis","planning-reasoning"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mathvista__cap_5","uri":"capability://search.retrieval.interactive.benchmark.visualization.and.exploration","name":"interactive benchmark visualization and exploration","description":"Provides a web-based visualization tool (🔮 Visualize) accessible at https://mathvista.github.io for exploring individual benchmark examples, filtering by mathematical domain and visual context type, and understanding benchmark composition. The tool enables researchers to browse examples, examine model predictions vs. ground truth, and identify patterns in model failures or benchmark difficulty. This interactive exploration complements the leaderboard and dataset documentation by making benchmark content directly inspectable.","intents":["Explore individual benchmark examples to understand what visual-mathematical reasoning tasks look like","Identify patterns in model failures by examining examples where models struggle","Understand benchmark composition and difficulty distribution across domains","Communicate benchmark content to stakeholders or team members via interactive exploration"],"best_for":["Researchers analyzing model failure modes on visual-mathematical reasoning tasks","Teams evaluating whether benchmark is suitable for their use case","Educators or communicators explaining multimodal model capabilities to non-technical audiences","Benchmark users wanting to understand example characteristics before running full evaluation"],"limitations":["Visualization tool capabilities not documented — unclear whether it supports filtering, sorting, searching, or only browsing","No documentation of whether tool displays model predictions or only ground truth labels","Tool may not scale to full 6,141 examples — unclear whether all examples are browsable or only subset","No export functionality documented — unclear whether users can download filtered subsets of examples","Performance and responsiveness of web tool not documented — may be slow for large-scale exploration"],"requires":["Web browser with JavaScript support","Internet access to https://mathvista.github.io","No authentication required (tool appears to be publicly accessible)"],"input_types":["user interactions (filtering by domain, visual context type, searching by keyword, etc.)"],"output_types":["visual display (image, question, ground truth answer, model predictions if available)","metadata (example source, domain, visual context type, difficulty estimate if available)"],"categories":["search-retrieval","image-visual"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mathvista__cap_6","uri":"capability://planning.reasoning.compositional.visual.mathematical.reasoning.evaluation","name":"compositional visual-mathematical reasoning evaluation","description":"Evaluates models' ability to perform compositional reasoning where visual perception and mathematical logic must be jointly applied to solve problems. Unlike benchmarks that test visual understanding (image captioning) or mathematical reasoning (text-only math problems) separately, MathVista requires models to interpret visual representations (diagrams, charts, figures) AND apply mathematical reasoning to derive correct answers. This compositional requirement is enforced through benchmark design where examples cannot be solved from visual content alone or text description alone, but require both modalities.","intents":["Assess whether a multimodal model can jointly apply visual understanding and mathematical reasoning to solve complex problems","Identify whether model failures are due to visual interpretation, mathematical reasoning, or inability to compose these capabilities","Evaluate models' ability to handle fine-grained visual understanding of complex mathematical figures","Test whether models can perform rigorous mathematical reasoning on visually-presented problems"],"best_for":["Researchers studying compositional reasoning in multimodal models","Teams developing applications requiring joint visual-mathematical understanding (e.g., scientific analysis, engineering design)","Model developers improving multimodal reasoning capabilities","Organizations assessing whether models can handle real-world visual-mathematical tasks (e.g., analyzing research papers, interpreting technical diagrams)"],"limitations":["No explicit documentation of how compositional requirement is enforced or validated — unclear whether examples were manually verified to require both visual and mathematical reasoning","No analysis of whether models fail due to visual interpretation, mathematical reasoning, or composition — benchmark reports only final accuracy, not intermediate step correctness","No evaluation of whether models can explain their reasoning or only produce final answers — limits understanding of whether models are truly reasoning compositionally or pattern-matching","No adversarial evaluation to test robustness of compositional reasoning — unclear whether models are brittle to visual perturbations or reasoning shortcuts","No analysis of which compositional reasoning patterns are most challenging (e.g., multi-step reasoning, spatial reasoning, quantitative reasoning)"],"requires":["Multimodal model capable of processing images and text jointly","Ability to interpret and reason about visual mathematical representations","Mathematical reasoning capability (arithmetic, geometry, statistics, etc.)"],"input_types":["image (visual mathematical representation: diagram, chart, figure, etc.)","text (question prompt requiring interpretation of visual content and mathematical reasoning)"],"output_types":["text (answer to question, demonstrating joint visual-mathematical reasoning)"],"categories":["planning-reasoning","image-visual"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mathvista__cap_7","uri":"capability://image.visual.fine.grained.visual.understanding.of.complex.mathematical.figures","name":"fine-grained visual understanding of complex mathematical figures","description":"Tests models' ability to accurately interpret fine-grained details in complex mathematical figures including geometry diagrams with precise spatial relationships, statistical charts with multiple data series and annotations, and scientific figures with technical notation and spatial complexity. The benchmark includes examples from research papers and technical documents where visual interpretation requires understanding of mathematical conventions (axis labels, legend symbols, geometric properties, etc.). This capability goes beyond general image understanding to require domain-specific visual literacy in mathematical representations.","intents":["Assess whether a multimodal model can accurately interpret fine-grained details in complex mathematical figures","Evaluate models' understanding of mathematical visual conventions (axis labels, legend symbols, geometric properties, etc.)","Test models' ability to handle spatial relationships and geometric reasoning from visual representations","Identify whether models struggle with specific types of mathematical figures (e.g., 3D diagrams, multi-panel figures, technical notation)"],"best_for":["Researchers studying visual understanding in multimodal models, particularly for technical/scientific content","Teams developing applications requiring interpretation of scientific figures or technical diagrams (e.g., research paper analysis, engineering design review)","Model developers improving visual understanding of mathematical representations","Organizations assessing whether models can reliably interpret figures in scientific or technical documents"],"limitations":["No documentation of visual complexity metrics or difficulty calibration — unclear whether all figures have similar complexity or some are significantly harder","No analysis of which types of figures are most challenging (e.g., 3D diagrams, multi-panel figures, dense technical notation)","No evaluation of whether models can explain what they see in figures or only produce final answers — limits understanding of visual interpretation quality","No adversarial evaluation to test robustness (e.g., figures with slight modifications, rotations, or noise) — unclear whether models are brittle to visual perturbations","No analysis of whether model failures are due to visual interpretation or mathematical reasoning on correctly-interpreted visual content","Documentation mentions models 'often struggle to understand complex figures' but provides no detailed analysis of failure modes"],"requires":["Multimodal model with strong visual understanding capabilities","Ability to process images with fine-grained details (high resolution, complex layouts)","Understanding of mathematical visual conventions (axis labels, legend symbols, geometric properties, etc.)"],"input_types":["image (geometry diagrams, statistical charts, scientific figures, research paper figures with fine-grained details)"],"output_types":["text (interpretation of visual content, answers to questions about figures)"],"categories":["image-visual","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mathvista__cap_8","uri":"capability://data.processing.analysis.human.performance.baseline.and.model.human.comparison","name":"human performance baseline and model-human comparison","description":"Establishes human performance baseline (~60.3% accuracy) on the benchmark, enabling quantification of how far current SOTA models fall short of human-level performance. The 10.4 percentage point gap between GPT-4V (49.9%) and human performance demonstrates that even best-in-class multimodal models struggle with compositional visual-mathematical reasoning. This baseline provides a clear target for model improvement and context for interpreting model performance (e.g., whether 49.9% accuracy is near-ceiling or far from human-level).","intents":["Understand how far current SOTA models are from human-level performance on visual-mathematical reasoning","Contextualize model accuracy scores relative to human performance","Identify whether benchmark is saturating (models approaching human performance) or has substantial headroom for improvement","Motivate model development by showing clear gap between SOTA and human performance"],"best_for":["Researchers evaluating progress toward human-level multimodal reasoning","Model developers setting improvement targets relative to human performance","Organizations assessing whether models are ready for production use (e.g., whether 49.9% accuracy is acceptable for their application)","Benchmark maintainers tracking progress over time relative to human baseline"],"limitations":["Human evaluation methodology not documented — unclear how many annotators evaluated examples, what agreement thresholds were used, or how disagreements were resolved","No confidence intervals or error bars on human performance estimate — unclear whether 60.3% is reliable or subject to variance","No analysis of which examples humans find most challenging or easy — unclear whether human errors are due to ambiguous questions, difficult reasoning, or other factors","No documentation of whether human evaluators had access to auxiliary annotations (OCR, captions) or only raw images — affects comparability with model evaluation","No analysis of inter-annotator agreement — unclear whether 60.3% represents consensus or average of disagreeing annotators","Human performance may not represent true ceiling if annotators were not domain experts in mathematics"],"requires":["Human evaluators with mathematical reasoning capability","Ability to display visual mathematical examples to humans and collect responses","Methodology for aggregating human responses (majority vote, consensus, etc.)"],"input_types":["image (visual mathematical representation)","text (question prompt)"],"output_types":["human performance metric (accuracy, ~60.3%)","model-human comparison (gap between model and human performance)"],"categories":["data-processing-analysis","planning-reasoning"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mathvista__cap_9","uri":"capability://data.processing.analysis.iclr.2024.oral.presentation.and.peer.reviewed.validation","name":"iclr 2024 oral presentation and peer-reviewed validation","description":"MathVista was accepted as an oral presentation at ICLR 2024 (85 out of 7,304 submissions, 1.2% acceptance rate), indicating peer-reviewed validation of the benchmark's design, methodology, and significance. The publication includes detailed methodology, results, and analysis reviewed by top-tier conference reviewers. This peer-reviewed validation provides confidence that the benchmark is well-designed and addresses important research questions, distinguishing it from non-peer-reviewed benchmarks or datasets.","intents":["Verify that benchmark design and methodology have been peer-reviewed and validated by experts","Access detailed methodology and analysis published in peer-reviewed venue","Cite benchmark in research papers with confidence that it has been validated by top-tier conference","Understand benchmark's significance and impact in the research community"],"best_for":["Researchers citing benchmark in academic papers and wanting peer-reviewed validation","Teams evaluating benchmark quality and rigor","Organizations assessing benchmark's standing in research community","Benchmark users wanting to understand methodology details published in peer-reviewed paper"],"limitations":["Peer review validates benchmark design but does not guarantee absence of limitations or biases — reviewers may have missed issues","Oral presentation status (1.2% acceptance rate) indicates high quality but does not guarantee benchmark will become widely-adopted standard","Peer review was conducted at time of publication (ICLR 2024) — subsequent issues or limitations discovered after publication are not reflected in peer review"],"requires":["Access to ICLR 2024 proceedings or arXiv preprint","Ability to read and understand academic paper describing benchmark methodology"],"input_types":["peer review feedback (implicit in published paper)"],"output_types":["validated benchmark design and methodology","published paper with detailed analysis and results"],"categories":["data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mathvista__headline","uri":"capability://testing.quality.visual.mathematical.reasoning.benchmark","name":"visual mathematical reasoning benchmark","description":"MathVista is a benchmark designed to evaluate AI models' ability to interpret and solve mathematical problems represented visually, combining geometry, statistics, and scientific figures.","intents":["best visual math benchmark","benchmark for AI mathematical reasoning","evaluate models on visual math tasks","top benchmarks for visual understanding in math","assess AI performance in visual mathematical contexts"],"best_for":["AI researchers","developers testing visual reasoning"],"limitations":["does not measure non-visual math reasoning"],"requires":[],"input_types":[],"output_types":[],"categories":["testing-quality"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":62,"verified":false,"data_access_risk":"high","permissions":["Access to multimodal model (GPT-4V, Gemini Ultra, Bard, or open-source LMM with vision capabilities)","Ability to process and display images in multiple formats (JPEG, PNG, PDF figures)","Python 3.7+ for dataset loading and evaluation script execution","Hugging Face account for dataset access (free tier sufficient)","GPU or TPU for efficient batch evaluation of large models (optional but recommended for full benchmark run)","Hugging Face account (free tier sufficient for dataset download)","Python 3.7+ with datasets library for programmatic access","Image processing library (PIL, OpenCV) to load and manipulate visual examples","Storage capacity for 6,141 images plus metadata (estimated 2-5 GB depending on image resolution)","Git and GitHub account for accessing code repository"],"failure_modes":["No inter-annotator agreement metrics or annotation quality documentation provided, limiting confidence in ground truth labels","No data contamination analysis against LLM/LMM training corpora — risk that source datasets or similar content appears in model training data","Performance ceiling at ~60% human accuracy suggests benchmark may not saturate current SOTA, but no analysis of whether gap reflects genuine capability limits or annotation ambiguity","Exact task format distribution (multiple-choice vs. free-form percentages) unknown, preventing targeted evaluation of specific reasoning types","No statistical significance testing between model comparisons — reported accuracy differences may not be statistically meaningful","Evaluation methodology for GPT-4V was manual via playground chatbot, not standardized API evaluation, introducing potential inconsistency","No breakdown of performance by mathematical domain or visual context type in public documentation, limiting diagnostic capability","Dataset composition bias unknown — no documentation of how many examples come from each of the 28 source datasets, risking overrepresentation of certain visual styles or mathematical domains","No explicit documentation of inter-annotator agreement for newly created datasets (IQTest, FunctionQA, PaperQA), limiting confidence in label quality","Exact train/dev/test split sizes and stratification strategy unknown, preventing reproducible dataset partitioning","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.7,"quality":0.9,"ecosystem":0.3,"match_graph":0.25,"freshness":0.75,"weights":{"adoption":0.25,"quality":0.35,"ecosystem":0.15,"match_graph":0.2,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-05-24T12:16:23.328Z","last_scraped_at":null,"last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=mathvista","compare_url":"https://unfragile.ai/compare?artifact=mathvista"}},"signature":"7O4DqX7VRpjPeLIKAOE4ax/InY9NLmoHGeK6YcXFRXi4sjQX+h+3Ta0tSuY1FMaPXnJ5evNWcaBIzlPAo/ZhAQ==","signedAt":"2026-06-20T10:40:49.252Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/mathvista","artifact":"https://unfragile.ai/mathvista","verify":"https://unfragile.ai/api/v1/verify?slug=mathvista","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}