Multi Subject Knowledge Evaluation Across 57 Academic Domains

1

MMLUBenchmark63/100

via “few-shot multitask evaluation across 57 knowledge domains”

57-subject knowledge benchmark — 15K+ questions across STEM, humanities, professional domains.

Unique: Organizes 15,908 questions hierarchically across 57 subjects with standardized few-shot prompting (5 examples per subject) and aggregates results at multiple granularity levels (subject, category, overall), enabling both broad coverage assessment and fine-grained domain analysis in a single evaluation run

vs others: Broader coverage than domain-specific benchmarks (57 subjects vs 1-5) and more standardized than ad-hoc evaluation, making it the de facto general knowledge benchmark for LLM comparison in research and industry

2

WMDPBenchmark63/100

via “multi-domain dangerous knowledge assessment across biosecurity, cybersecurity, and chemical security”

Benchmark for dangerous knowledge in LLMs.

Unique: Combines expert-validated questions across three distinct security domains (biosecurity, cybersecurity, chemical) into a unified benchmark framework, rather than treating each domain separately. Uses domain-expert rubrics for scoring rather than automated classifiers, ensuring nuanced assessment of harmful capability presence.

vs others: More comprehensive than single-domain safety benchmarks (e.g., ToxiGen for toxicity) because it measures dangerous knowledge across multiple hazard categories simultaneously, enabling holistic safety evaluation.

3

MMLU (Massive Multitask Language Understanding)Benchmark61/100

via “subject-specific knowledge profiling”

57-subject benchmark, the standard metric for comparing LLMs.

Unique: Covers 57 distinct subjects spanning STEM, humanities, social sciences, and professional domains in a single benchmark, providing comprehensive domain coverage that no single-subject benchmark achieves. Subject taxonomy is derived from real academic curricula and professional certification exams.

vs others: Broader subject coverage than domain-specific benchmarks (e.g., MedQA for medicine only) while maintaining standardization across all subjects, enabling both broad knowledge assessment and targeted domain evaluation in one dataset.

4

MMMUBenchmark61/100

via “expert-level multimodal reasoning evaluation across 30 college subjects”

Expert-level multimodal understanding across 30 subjects.

Unique: MMMU is the only benchmark combining (1) 11,500 questions across 30 college subjects and 183 subfields, (2) 30 heterogeneous visual modality types (including domain-specific visuals like chemical structures and music sheets), and (3) explicit sourcing from authentic college exams/textbooks/lectures rather than synthetic or crowdsourced data. This scale and diversity of real-world academic content distinguishes it from narrower benchmarks like MMVP or ScienceQA which focus on single domains or simpler visual reasoning.

vs others: MMMU covers 6x more disciplines and 3x more subjects than domain-specific benchmarks (e.g., MedQA for medicine only), and includes heterogeneous visual types (chemical structures, music sheets) absent from general-purpose multimodal benchmarks like LVLM-eHub, making it the most comprehensive test of expert-level multimodal reasoning across academic domains.

5

ARC (AI2 Reasoning Challenge)Dataset58/100

via “multi-domain science knowledge assessment”

7.8K science questions testing genuine reasoning, not just recall.

Unique: Provides explicit domain labels (physics, chemistry, biology, earth science) for all 7,787 questions, enabling direct per-domain accuracy computation without requiring external domain classification. The Challenge subset maintains domain balance, ensuring that reasoning difficulty is not confounded with domain-specific knowledge gaps.

vs others: More granular than generic science benchmarks that lump all science questions together; enables domain-specific debugging that single-domain benchmarks (e.g., physics-only) cannot provide

6

TriviaQADataset58/100

via “world knowledge and domain coverage evaluation”

95K trivia questions requiring cross-document reasoning.

Unique: Curated by trivia enthusiasts across diverse knowledge domains rather than extracted from a single source or task, providing natural distribution of world knowledge questions. The 95,000-question scale enables statistical analysis of performance across domains and identification of knowledge gaps, unlike smaller datasets that may not have sufficient coverage for domain-level evaluation.

vs others: Broader domain coverage than Natural Questions (which focuses on Wikipedia-answerable questions) and more diverse than MS MARCO (web search results), making it better for evaluating general-purpose world knowledge and identifying domain-specific weaknesses in QA systems.

7

MMLUBenchmark49/100

via “multi-domain knowledge assessment”

Massive multitask language understanding across 57 domains

Unique: MMLU's structured approach to benchmarking across multiple domains allows for a comprehensive evaluation that is widely accepted in the AI research community, unlike ad-hoc or domain-specific benchmarks.

vs others: MMLU provides a more standardized and comprehensive evaluation across diverse academic fields compared to other benchmarks that may focus on narrower domains.

8

chinese-llm-benchmarkBenchmark45/100

via “professional domain-specific knowledge evaluation (medical, finance, law, administrative)”

ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括374个大模型，覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型，以及step3.5-flash、kimi-k2.6、ernie4.5、MiniMax-M2.7、deepseek-v4、Qwen3.6、llama4、智谱GLM-5.1、MiMo-V2、LongCat、gemma4、mistral等开源大模型。不仅提供排行榜，也提供规模超200万的大

Unique: Evaluates four professional domains (Medical, Finance, Law, Administrative) using domain-expert-designed test questions with realistic scenarios (medical case studies, financial analysis, legal document interpretation) rather than generic knowledge questions. Incorporates domain-specific scoring rubrics reflecting professional standards and best practices. Enables cross-domain comparison to identify models suitable for professional applications.

vs others: More specialized domain assessment than general benchmarks (MMLU, C-Eval) and realistic professional scenarios vs academic knowledge questions

9

mmluDataset24/100

via “academic subject taxonomy and hierarchical filtering”

Dataset by cais. 4,76,392 downloads.

Unique: Explicit subject labels for every question enable filtering without external knowledge graphs or NLP-based categorization. 57-subject taxonomy is comprehensive and expert-validated, covering STEM, humanities, social sciences, and professional domains in single dataset.

vs others: More granular than generic QA datasets (SQuAD, RACE) while maintaining simplicity of flat taxonomy versus complex hierarchical ontologies

10

SoBriefProduct

via “cross-domain-knowledge-synthesis”

11

AtlasProduct

via “multi-subject-knowledge-base-access”

Top Matches

Also Known As

Company