{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"humanity-s-last-exam","slug":"humanity-s-last-exam","name":"Humanity's Last Exam","type":"benchmark","url":"https://lastexam.ai","page_url":"https://unfragile.ai/humanity-s-last-exam","categories":["testing-quality"],"tags":[],"pricing":{"model":"free","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"humanity-s-last-exam__cap_0","uri":"capability://data.processing.analysis.expert.curated.multidisciplinary.exam.question.compilation","name":"expert-curated multidisciplinary exam question compilation","description":"Aggregates 2,500 exam questions sourced from 100+ named contributors across academic disciplines through a collaborative curation process. Questions are vetted through a bug bounty program (closed 03/21/2025) that identified and removed searchable/contaminated items, with replacements integrated into the final dataset. The compilation represents a snapshot of expert consensus on difficult, knowledge-testing problems designed to challenge AI reasoning across domains.","intents":["Evaluate whether my AI model can solve the hardest problems experts consider important across academia","Benchmark my model's reasoning capabilities against a curated standard before claiming superhuman performance","Access a contamination-aware exam dataset that removes known training data leakage risks"],"best_for":["AI researchers evaluating frontier model capabilities","Teams building safety benchmarks for superhuman AI performance claims","Organizations needing disciplinary breadth in evaluation (not domain-specific testing)"],"limitations":["Discipline distribution across 2,500 questions is not publicly documented — unknown if coverage is balanced or skewed toward STEM","Contamination removal scope unclear — number of questions removed via bug bounty and replacement strategy not disclosed","No baseline performance data or SOTA results provided — cannot contextualize difficulty or discriminative power","Task format (multiple choice vs. free-form vs. code execution) not specified in available documentation","Evaluation methodology and scoring function (accuracy, pass@k, F1) not documented"],"requires":["Access to HuggingFace Datasets library (Python 3.7+) to load via `load_dataset('cais/hle')`","Internet connection for dataset download from HuggingFace Hub","Custom evaluation harness — no reference implementation provided","Model API access or local inference capability to generate responses"],"input_types":["exam questions (text-based, format unspecified)","model predictions (format depends on task type, not documented)"],"output_types":["structured dataset (HuggingFace Datasets format)","evaluation scores (metric type unknown)","leaderboard rankings (via HLE-Rolling Live Submission Dashboard)"],"categories":["data-processing-analysis","benchmarking"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"humanity-s-last-exam__cap_1","uri":"capability://automation.workflow.rolling.dynamic.benchmark.contribution.system","name":"rolling dynamic benchmark contribution system","description":"Provides HLE-Rolling, a dynamic fork released 10/08/2025 that accepts ongoing question contributions from the community via email submission to agibenchmark@safe.ai. Contributors can propose new exam questions that are integrated into a living version of the benchmark with update logs. This enables continuous evolution of the benchmark as new domains emerge or expert consensus shifts, while maintaining the original 2,500-question snapshot as a fixed reference point.","intents":["Contribute new exam questions to keep the benchmark current as AI capabilities advance","Access a rolling version of the benchmark that reflects emerging expert consensus","Track how benchmark composition changes over time via update logs"],"best_for":["Academic institutions and research labs contributing domain expertise","Benchmark maintainers seeking community-driven evolution without full revalidation","Teams needing a living benchmark that adapts faster than peer-reviewed snapshots"],"limitations":["Contribution process is email-based (agibenchmark@safe.ai) with no documented review criteria, timeline, or acceptance rate","Update logs for HLE-Rolling are referenced but not provided in accessible documentation","No specification of how rolling contributions are validated or deduplicated against existing questions","Risk of quality degradation or contamination in rolling version compared to peer-reviewed original","No version control or rollback mechanism documented for problematic contributions"],"requires":["Email access to submit contributions to agibenchmark@safe.ai","Exam question in unspecified format (no template or schema provided)","Affiliation with academic or research institution (implied, not stated)"],"input_types":["exam questions (text, format unspecified)","metadata about question source and discipline (format unknown)"],"output_types":["integration into HLE-Rolling dataset","update log entries (format unknown)","leaderboard impact (if question becomes part of evaluation)"],"categories":["automation-workflow","community-contribution"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"humanity-s-last-exam__cap_2","uri":"capability://data.processing.analysis.huggingface.datasets.integration.with.programmatic.access","name":"huggingface datasets integration with programmatic access","description":"Exposes the 2,500-question benchmark via HuggingFace Datasets library under the dataset ID `cais/hle`, enabling one-line programmatic loading via `load_dataset('cais/hle')`. This integration provides standardized data format compatibility with the HuggingFace ecosystem, allowing researchers to load, filter, and evaluate models using standard HF evaluation frameworks without custom data pipelines. The dataset is versioned and hosted on HuggingFace Hub infrastructure.","intents":["Load the benchmark dataset into my Python evaluation pipeline without writing custom data loaders","Use HuggingFace Evaluate framework to score model predictions against HLE questions","Version-control and reproduce benchmark results by pinning the HF dataset version"],"best_for":["Python-based ML researchers using HuggingFace ecosystem tools","Teams already invested in HF Transformers, Datasets, and Evaluate libraries","Researchers needing standardized data formats for reproducible benchmarking"],"limitations":["Requires HuggingFace Datasets library — adds dependency on external package management","No offline mode documented — requires internet connectivity to download from HF Hub on first load","Dataset schema and column names not documented — users must inspect dataset to understand structure","No built-in evaluation metrics provided — HF integration is data-only, not evaluation-complete","Caching behavior and storage requirements not specified"],"requires":["Python 3.7+","HuggingFace Datasets library (pip install datasets)","Internet connection for initial dataset download","HuggingFace Hub account (free) for authentication if dataset is access-restricted"],"input_types":["dataset identifier string ('cais/hle')","optional split specification (train/dev/test, if available)"],"output_types":["HuggingFace Dataset object (Arrow-backed columnar format)","rows accessible as dictionaries with question/answer/metadata fields"],"categories":["data-processing-analysis","tool-use-integration"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"humanity-s-last-exam__cap_3","uri":"capability://automation.workflow.leaderboard.submission.and.ranking.dashboard","name":"leaderboard submission and ranking dashboard","description":"Provides HLE-Rolling Live Submission Dashboard where researchers can submit model predictions and view real-time rankings. The submission process is email-based (agibenchmark@safe.ai) with an unspecified format and evaluation timeline. The dashboard aggregates results across submitted models and displays comparative performance, enabling researchers to benchmark their models against peers and track progress over time. Submission mechanics, evaluation latency, and result publication policy are not documented.","intents":["Submit my model's predictions on HLE questions and see how it ranks against other models","Track leaderboard progress over time to understand if my model improvements are competitive","Discover what models are performing best on the benchmark to identify baselines"],"best_for":["AI research teams competing on a public leaderboard for visibility and validation","Organizations seeking external benchmarking without building internal evaluation infrastructure","Researchers wanting to compare their models against published baselines"],"limitations":["Leaderboard URL not provided in documentation — dashboard location is unclear","Submission format, file structure, and required metadata not specified","Evaluation timeline unknown — no SLA for when results appear on leaderboard","No documentation of evaluation environment (hardware, timeout limits, resource constraints)","Ranking methodology unclear — unknown if scoring is accuracy, pass@k, or custom metric","No baseline results or SOTA numbers provided — cannot contextualize leaderboard positions","Submission frequency limits not documented — unknown if daily/weekly/one-time submissions allowed","No mechanism for disputing results or requesting re-evaluation documented"],"requires":["Email access to submit to agibenchmark@safe.ai","Model predictions in unspecified format","Access to HLE-Rolling Live Submission Dashboard (URL unknown)","Affiliation or institutional context (implied, not stated)"],"input_types":["model predictions (format unspecified)","metadata about model (name, version, organization, etc. — schema unknown)"],"output_types":["leaderboard ranking (position, score, percentile — metrics unknown)","comparative performance visualization (format unknown)","result publication (timing and visibility policy unknown)"],"categories":["automation-workflow","search-retrieval"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"humanity-s-last-exam__cap_4","uri":"capability://safety.moderation.contamination.detection.and.remediation.via.bug.bounty","name":"contamination detection and remediation via bug bounty","description":"Implements a formal bug bounty program (closed 03/21/2025) that incentivizes researchers to identify questions in the benchmark that are searchable in public training data or otherwise contaminated. Identified questions are flagged, removed from the final 2,500-question set, and replaced with new questions. This post-hoc contamination mitigation approach addresses a critical validity threat by explicitly removing known leakage risks before publication, rather than assuming training data isolation.","intents":["Identify and report exam questions that appear in public training data or are otherwise compromised","Ensure the final benchmark is free of known contamination before using it for model evaluation","Contribute to benchmark quality by participating in a structured bug bounty program"],"best_for":["Benchmark maintainers seeking to validate dataset integrity before publication","Researchers concerned about contamination in AI evaluation benchmarks","Organizations building safety-critical benchmarks where contamination invalidates results"],"limitations":["Bug bounty is closed (03/21/2025) — no longer accepting new contamination reports","Scope of 'searchable' questions is undefined — unclear what constitutes contamination (exact match, paraphrase, concept similarity)","Number of questions removed and replaced not disclosed — unknown how many items were flagged","Replacement question provenance not documented — unclear if replacements are equally vetted","No post-remediation validation study provided — unknown if removed questions were truly contaminated or false positives","Remaining contamination risk not quantified — no estimate of undetected leakage in final set","Bug bounty incentive structure not documented — unknown if financial rewards were offered"],"requires":["Access to the benchmark questions (via HuggingFace or GitHub)","Ability to search public training data sources (Common Crawl, GitHub, academic databases)","Evidence of contamination (e.g., exact text match, source attribution)","Email submission to agibenchmark@safe.ai during bug bounty period (now closed)"],"input_types":["exam question text","evidence of contamination (source URL, training data reference, etc.)"],"output_types":["contamination report (accepted or rejected)","removal from benchmark (if validated)","replacement question (if accepted)"],"categories":["safety-moderation","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"humanity-s-last-exam__cap_5","uri":"capability://safety.moderation.nature.published.peer.reviewed.validation","name":"nature-published peer-reviewed validation","description":"The benchmark is published in Nature (Nature 649, 1139–1146, 01/28/2026), providing formal peer review and editorial validation of the benchmark's methodology, validity, and results. This publication signals that the benchmark has undergone rigorous scrutiny by domain experts and meets standards for reproducibility and scientific rigor. The Nature publication establishes the benchmark as a citable reference point for AI evaluation and provides methodological transparency through the peer-reviewed paper.","intents":["Use a peer-reviewed benchmark in my research to ensure credibility and citability","Understand the benchmark's methodology, limitations, and validity through the Nature paper","Reference the benchmark in publications knowing it has undergone formal scientific review"],"best_for":["Academic researchers needing peer-reviewed benchmarks for publication","Teams building safety-critical AI systems requiring validated evaluation standards","Organizations seeking benchmarks with transparent methodology and documented limitations"],"limitations":["Nature paper details (methodology, results, limitations) not included in available documentation — must access the full paper separately","Publication date is 01/28/2026, which is a future date relative to typical document creation — unclear if this is a projection or the document is from 2026","Peer review process details not documented — unknown if reviewers had access to full dataset or only summary results","Nature publication does not guarantee the benchmark remains valid as AI capabilities advance — peer review is a snapshot in time","Access to Nature paper may require institutional subscription or payment"],"requires":["Access to Nature journal (subscription or open access)","Citation: Nature 649, 1139–1146 (01/28/2026)","Ability to read and interpret peer-reviewed scientific paper"],"input_types":["benchmark methodology and results (as described in Nature paper)"],"output_types":["peer-reviewed publication","citable reference (Nature citation format)","methodological documentation and limitations"],"categories":["safety-moderation","planning-reasoning"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"humanity-s-last-exam__cap_6","uri":"capability://data.processing.analysis.multidisciplinary.expert.curation.across.100.contributors","name":"multidisciplinary expert curation across 100+ contributors","description":"Aggregates exam questions from 100+ named contributors spanning diverse academic institutions and disciplines. The curation process involves distributed expertise validation where questions are proposed by domain experts and vetted through the bug bounty and editorial process. This collaborative approach ensures breadth of coverage across disciplines and reduces single-lab bias compared to benchmarks created by a single research team. Contributor affiliations and discipline distribution are documented but not detailed in available materials.","intents":["Evaluate my model on exam questions vetted by domain experts across multiple disciplines","Access a benchmark that represents consensus across diverse academic institutions, not a single lab's perspective","Understand which experts contributed to which questions for transparency and accountability"],"best_for":["Researchers seeking multidisciplinary evaluation beyond single-domain benchmarks","Teams building general-purpose AI systems needing broad coverage","Organizations valuing distributed expertise and reduced single-lab bias"],"limitations":["Discipline distribution across 100+ contributors is not documented — unknown if coverage is balanced (e.g., 50% STEM vs. 50% humanities)","Contributor selection criteria not specified — unknown if selection was open, invited, or biased toward certain institutions","Contribution process for experts not documented — unclear how questions are submitted, reviewed, and integrated","Contributor affiliations listed but not mapped to specific questions — cannot trace which expert created which question","No analysis of inter-rater agreement or consensus strength — unknown if experts agree on question difficulty or correctness","Potential for institutional bias — unknown if certain universities or research labs are overrepresented"],"requires":["Access to contributor list and affiliations (not provided in available documentation)","Understanding of academic disciplines represented (not documented)"],"input_types":["exam questions from domain experts","expert credentials and affiliations"],"output_types":["curated benchmark with expert-validated questions","contributor attribution (if available)"],"categories":["data-processing-analysis","planning-reasoning"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"humanity-s-last-exam__cap_7","uri":"capability://data.processing.analysis.fixed.2500.question.snapshot.for.reproducibility","name":"fixed 2500-question snapshot for reproducibility","description":"Provides a stable, finalized set of 2,500 exam questions (as of 04/03/2025) that serves as the reference benchmark for reproducible evaluation. This fixed snapshot is distinct from the rolling HLE-Rolling version and enables researchers to conduct evaluations that can be exactly reproduced by other teams using the same question set. The snapshot is versioned and published in Nature, establishing it as a canonical reference point for AI evaluation.","intents":["Evaluate my model on a fixed benchmark that won't change, enabling reproducible results","Compare my model's performance to other models evaluated on the exact same 2,500 questions","Cite a specific benchmark version in my research without worrying about future updates"],"best_for":["Researchers conducting rigorous, reproducible benchmarking studies","Teams needing stable baselines for longitudinal model evaluation","Organizations publishing results that must be independently verified"],"limitations":["Fixed set of 2,500 questions may become outdated as AI capabilities advance or new domains emerge","No mechanism to update the snapshot without creating a new version — cannot address discovered errors or biases","Snapshot is static as of 04/03/2025 — unknown if newer versions will be released","Evaluation results on this snapshot may not generalize to future AI systems with different training data or architectures","No documentation of how the snapshot was selected from a larger pool — unknown if selection was random or biased"],"requires":["Access to the fixed 2,500-question dataset (via HuggingFace or GitHub)","Evaluation harness to score model predictions (not provided)"],"input_types":["2,500 exam questions (fixed set)"],"output_types":["model predictions on all 2,500 questions","evaluation scores (metric type unknown)","reproducible results for comparison with other models"],"categories":["data-processing-analysis","planning-reasoning"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"humanity-s-last-exam__headline","uri":"capability://testing.quality.ai.knowledge.and.reasoning.benchmark","name":"ai knowledge and reasoning benchmark","description":"Humanity's Last Exam is a collaborative benchmark that compiles the hardest exam questions from experts across various academic disciplines, designed to test AI's knowledge and reasoning capabilities before achieving superhuman performance.","intents":["best AI benchmark for knowledge assessment","AI reasoning test for academic disciplines","hardest exam questions for AI evaluation","ultimate AI performance test","AI knowledge benchmark for superhuman evaluation"],"best_for":["evaluating AI performance","testing AI reasoning capabilities"],"limitations":[],"requires":[],"input_types":[],"output_types":[],"categories":["testing-quality"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":61,"verified":false,"data_access_risk":"high","permissions":["Access to HuggingFace Datasets library (Python 3.7+) to load via `load_dataset('cais/hle')`","Internet connection for dataset download from HuggingFace Hub","Custom evaluation harness — no reference implementation provided","Model API access or local inference capability to generate responses","Email access to submit contributions to agibenchmark@safe.ai","Exam question in unspecified format (no template or schema provided)","Affiliation with academic or research institution (implied, not stated)","Python 3.7+","HuggingFace Datasets library (pip install datasets)","Internet connection for initial dataset download"],"failure_modes":["Discipline distribution across 2,500 questions is not publicly documented — unknown if coverage is balanced or skewed toward STEM","Contamination removal scope unclear — number of questions removed via bug bounty and replacement strategy not disclosed","No baseline performance data or SOTA results provided — cannot contextualize difficulty or discriminative power","Task format (multiple choice vs. free-form vs. code execution) not specified in available documentation","Evaluation methodology and scoring function (accuracy, pass@k, F1) not documented","Contribution process is email-based (agibenchmark@safe.ai) with no documented review criteria, timeline, or acceptance rate","Update logs for HLE-Rolling are referenced but not provided in accessible documentation","No specification of how rolling contributions are validated or deduplicated against existing questions","Risk of quality degradation or contamination in rolling version compared to peer-reviewed original","No version control or rollback mechanism documented for problematic contributions","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.7,"quality":0.8500000000000001,"ecosystem":0.3,"match_graph":0.25,"freshness":0.75,"weights":{"adoption":0.25,"quality":0.35,"ecosystem":0.15,"match_graph":0.2,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-05-24T12:16:23.327Z","last_scraped_at":null,"last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=humanity-s-last-exam","compare_url":"https://unfragile.ai/compare?artifact=humanity-s-last-exam"}},"signature":"z6tQyVyHcobuy5JZMU1IAzHkg18WsIRXNCbW4NuLVyk1xB5vOH6dIJvolUQz2xZKf2aJ+jE/JvDgdhKfIIfgCw==","signedAt":"2026-06-20T14:10:19.668Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/humanity-s-last-exam","artifact":"https://unfragile.ai/humanity-s-last-exam","verify":"https://unfragile.ai/api/v1/verify?slug=humanity-s-last-exam","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}