Capability
11 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →via “multi-domain dangerous knowledge assessment across biosecurity, cybersecurity, and chemical security”
Benchmark for dangerous knowledge in LLMs.
Unique: Combines expert-validated questions across three distinct security domains (biosecurity, cybersecurity, chemical) into a unified benchmark framework, rather than treating each domain separately. Uses domain-expert rubrics for scoring rather than automated classifiers, ensuring nuanced assessment of harmful capability presence.
vs others: More comprehensive than single-domain safety benchmarks (e.g., ToxiGen for toxicity) because it measures dangerous knowledge across multiple hazard categories simultaneously, enabling holistic safety evaluation.
via “subject-specific knowledge profiling”
57-subject benchmark, the standard metric for comparing LLMs.
Unique: Covers 57 distinct subjects spanning STEM, humanities, social sciences, and professional domains in a single benchmark, providing comprehensive domain coverage that no single-subject benchmark achieves. Subject taxonomy is derived from real academic curricula and professional certification exams.
vs others: Broader subject coverage than domain-specific benchmarks (e.g., MedQA for medicine only) while maintaining standardization across all subjects, enabling both broad knowledge assessment and targeted domain evaluation in one dataset.
via “few-shot multitask evaluation across 57 knowledge domains”
57-subject knowledge benchmark — 15K+ questions across STEM, humanities, professional domains.
Unique: Organizes 15,908 questions hierarchically across 57 subjects with standardized few-shot prompting (5 examples per subject) and aggregates results at multiple granularity levels (subject, category, overall), enabling both broad coverage assessment and fine-grained domain analysis in a single evaluation run
vs others: Broader coverage than domain-specific benchmarks (57 subjects vs 1-5) and more standardized than ad-hoc evaluation, making it the de facto general knowledge benchmark for LLM comparison in research and industry
via “expert-level multimodal reasoning evaluation across 30 college subjects”
Expert-level multimodal understanding across 30 subjects.
Unique: MMMU is the only benchmark combining (1) 11,500 questions across 30 college subjects and 183 subfields, (2) 30 heterogeneous visual modality types (including domain-specific visuals like chemical structures and music sheets), and (3) explicit sourcing from authentic college exams/textbooks/lectures rather than synthetic or crowdsourced data. This scale and diversity of real-world academic content distinguishes it from narrower benchmarks like MMVP or ScienceQA which focus on single domains or simpler visual reasoning.
vs others: MMMU covers 6x more disciplines and 3x more subjects than domain-specific benchmarks (e.g., MedQA for medicine only), and includes heterogeneous visual types (chemical structures, music sheets) absent from general-purpose multimodal benchmarks like LVLM-eHub, making it the most comprehensive test of expert-level multimodal reasoning across academic domains.
via “multi-domain science knowledge assessment”
7.8K science questions testing genuine reasoning, not just recall.
Unique: Provides explicit domain labels (physics, chemistry, biology, earth science) for all 7,787 questions, enabling direct per-domain accuracy computation without requiring external domain classification. The Challenge subset maintains domain balance, ensuring that reasoning difficulty is not confounded with domain-specific knowledge gaps.
vs others: More granular than generic science benchmarks that lump all science questions together; enables domain-specific debugging that single-domain benchmarks (e.g., physics-only) cannot provide
via “world knowledge and domain coverage evaluation”
95K trivia questions requiring cross-document reasoning.
Unique: Curated by trivia enthusiasts across diverse knowledge domains rather than extracted from a single source or task, providing natural distribution of world knowledge questions. The 95,000-question scale enables statistical analysis of performance across domains and identification of knowledge gaps, unlike smaller datasets that may not have sufficient coverage for domain-level evaluation.
vs others: Broader domain coverage than Natural Questions (which focuses on Wikipedia-answerable questions) and more diverse than MS MARCO (web search results), making it better for evaluating general-purpose world knowledge and identifying domain-specific weaknesses in QA systems.
via “multi-domain knowledge assessment”
Massive multitask language understanding across 57 domains
Unique: MMLU's structured approach to benchmarking across multiple domains allows for a comprehensive evaluation that is widely accepted in the AI research community, unlike ad-hoc or domain-specific benchmarks.
vs others: MMLU provides a more standardized and comprehensive evaluation across diverse academic fields compared to other benchmarks that may focus on narrower domains.
via “professional domain-specific knowledge evaluation (medical, finance, law, administrative)”
ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括374个大模型,覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型, 以及step3.5-flash、kimi-k2.6、ernie4.5、MiniMax-M2.7、deepseek-v4、Qwen3.6、llama4、智谱GLM-5.1、MiMo-V2、LongCat、gemma4、mistral等开源大模型。不仅提供排行榜,也提供规模超200万的大
Unique: Evaluates four professional domains (Medical, Finance, Law, Administrative) using domain-expert-designed test questions with realistic scenarios (medical case studies, financial analysis, legal document interpretation) rather than generic knowledge questions. Incorporates domain-specific scoring rubrics reflecting professional standards and best practices. Enables cross-domain comparison to identify models suitable for professional applications.
vs others: More specialized domain assessment than general benchmarks (MMLU, C-Eval) and realistic professional scenarios vs academic knowledge questions
via “academic subject taxonomy and hierarchical filtering”
Dataset by cais. 4,76,392 downloads.
Unique: Explicit subject labels for every question enable filtering without external knowledge graphs or NLP-based categorization. 57-subject taxonomy is comprehensive and expert-validated, covering STEM, humanities, social sciences, and professional domains in single dataset.
vs others: More granular than generic QA datasets (SQuAD, RACE) while maintaining simplicity of flat taxonomy versus complex hierarchical ontologies
via “cross-domain-knowledge-synthesis”
via “multi-subject-knowledge-base-access”
Building an AI tool with “Multi Subject Knowledge Evaluation Across 57 Academic Domains”?
Submit your artifact →curl unfragile.ai/agents.md | sh© 2026 Unfragile. The platform for software for agents.