WMDP

BenchmarkFree

Benchmark for dangerous knowledge in LLMs.

Open Source

/ 100

6 capabilities

Capabilities6 decomposed

multi-domain dangerous knowledge assessment across biosecurity, cybersecurity, and chemical security

Medium confidence

Evaluates LLM outputs against curated benchmark questions spanning three high-risk domains (biosecurity, cybersecurity, chemical security) using domain-expert-validated test cases. The benchmark uses a standardized evaluation framework that scores model responses on their ability to provide actionable dangerous information, enabling quantitative measurement of hazardous capability presence across different model architectures and training approaches.

Solves for

Measure whether my LLM has learned dangerous knowledge that could enable WMD-related harmCompare dangerous capability levels across different model versions or training methodsIdentify which security domains my model is most vulnerable in before deploymentEstablish a baseline for dangerous knowledge to track unlearning effectiveness

Best for

AI safety researchers evaluating model alignment and unlearning techniques

LLM developers implementing safety-critical deployments in regulated domains

Red-teamers and security auditors assessing model robustness against misuse

Requires

Access to LLM inference API or local model deployment

Ability to run evaluation scripts (Python 3.8+)

Computational resources for running full benchmark suite (varies by model size)

Limitations

Benchmark questions may not capture all possible dangerous knowledge variants or novel attack vectors

Evaluation relies on human judgment for response scoring, introducing potential inconsistency across evaluators

Coverage limited to three domains; emerging threat vectors outside biosecurity/cybersecurity/chemical may not be represented

What makes it unique

Explicitly targets three high-consequence security domains (biosecurity, cybersecurity, chemical) with domain-expert-validated questions rather than generic safety benchmarks; uses a proxy measurement approach (dangerous knowledge as proxy for WMD capability) enabling evaluation without requiring actual harmful capability demonstration

vs alternatives

More targeted and domain-specific than general safety benchmarks like HELM or TruthfulQA, with explicit focus on actionable dangerous knowledge rather than truthfulness or helpfulness metrics

unlearning method evaluation and comparison framework

Medium confidence

Provides standardized evaluation infrastructure for testing unlearning techniques (methods designed to remove dangerous knowledge from trained models) by measuring performance degradation on dangerous tasks while preserving general model capabilities. The framework enables researchers to quantify the trade-off between safety (reducing dangerous knowledge) and utility (maintaining general performance) across different unlearning approaches.

Solves for

Evaluate whether my unlearning technique actually removes dangerous knowledge without crippling the modelCompare effectiveness of different unlearning methods (fine-tuning, gradient ascent, etc.) on the same baselineMeasure the generalization of unlearning across related dangerous tasks not explicitly trained onEstablish reproducible benchmarks for unlearning research publications

Best for

ML safety researchers developing and testing unlearning algorithms

Model developers implementing safety-critical fine-tuning pipelines

Academic teams publishing unlearning research with reproducible baselines

Requires

Base model with dangerous knowledge (typically a standard LLM)

Unlearning implementation (custom code or framework integration)

Computational resources for model fine-tuning and evaluation

Limitations

Unlearning evaluation assumes dangerous knowledge can be cleanly separated from general capabilities, which may not hold in practice

Benchmark may not capture all forms of knowledge retention (e.g., implicit knowledge encoded in model weights)

Requires access to model weights or full fine-tuning capability; cannot evaluate unlearning on frozen API-only models

What makes it unique

Provides integrated framework for measuring both safety improvements (dangerous knowledge reduction) and utility costs (general capability degradation) simultaneously, enabling quantitative trade-off analysis rather than isolated safety metrics

vs alternatives

More comprehensive than single-metric safety evaluations because it explicitly measures the safety-utility trade-off, helping researchers avoid trivial solutions like model lobotomization

domain-specific dangerous knowledge question generation and curation

Medium confidence

Maintains a curated dataset of dangerous knowledge questions across biosecurity, cybersecurity, and chemical security domains, validated by domain experts to ensure questions are realistic, actionable, and representative of actual threat vectors. Questions are structured with metadata (difficulty, specificity, prerequisite knowledge) enabling fine-grained evaluation and analysis of model vulnerabilities across threat categories.

Solves for

Access a validated set of dangerous knowledge questions to test my model without creating harmful content myselfUnderstand what types of dangerous knowledge my model has learned across different security domainsAnalyze patterns in model vulnerabilities (e.g., which biosecurity topics are most problematic)Create domain-specific safety evaluations for specialized models (e.g., biotech-focused assistants)

Best for

Safety researchers who need validated dangerous knowledge questions without creating them

Model developers implementing pre-deployment safety audits

Governance bodies establishing safety standards for LLM deployment

Requires

Access to WMDP benchmark dataset (publicly available)

Understanding of domain-specific dangerous knowledge to interpret questions

Ability to parse structured question metadata (JSON format)

Limitations

Question set is static and may not evolve as threat landscape changes

Curation process introduces human bias in what is considered 'dangerous' or 'actionable'

Questions may not capture all possible dangerous knowledge variants or novel attack vectors

What makes it unique

Curated by domain experts in biosecurity, cybersecurity, and chemical security rather than crowdsourced or automatically generated, ensuring questions represent realistic threat vectors and actionable dangerous knowledge

vs alternatives

More targeted and threat-realistic than generic adversarial question datasets because questions are validated by domain experts for actual actionability rather than theoretical harm potential

cross-model dangerous knowledge comparison and ranking

Medium confidence

Enables systematic comparison of dangerous knowledge levels across different LLM architectures, training methods, and safety interventions by running the same benchmark questions against multiple models and aggregating results into comparative rankings. Uses standardized scoring to make results comparable across models with different output formats, sizes, and training approaches.

Solves for

Determine which LLM models have the most dangerous knowledge before selecting one for deploymentRank the effectiveness of different safety training methods (RLHF, DPO, unlearning) on the same baselineTrack dangerous knowledge levels across model versions to ensure safety improvementsPublish comparative safety metrics for model selection decisions

Best for

Model developers comparing safety across their own model variants

Organizations evaluating multiple LLM providers for safety-critical applications

Researchers publishing comparative safety studies

Requires

Access to multiple LLM models (API endpoints or local deployments)

Computational resources to run full benchmark on each model

Standardized evaluation infrastructure (WMDP evaluation scripts)

Limitations

Comparison assumes all models are evaluated under identical conditions, which may not reflect real-world deployment differences

Ranking is relative to WMDP benchmark only; models may have different dangerous knowledge outside these domains

Scoring methodology may favor certain model architectures or training approaches

What makes it unique

Provides standardized infrastructure for comparing dangerous knowledge across heterogeneous models rather than isolated single-model evaluations, enabling relative safety assessment and ranking

vs alternatives

More actionable than individual model safety reports because comparative rankings directly support model selection decisions, whereas isolated metrics require manual interpretation

dangerous knowledge gradient analysis and vulnerability mapping

Medium confidence

Analyzes model responses to dangerous knowledge questions across difficulty levels, specificity dimensions, and prerequisite knowledge requirements to identify vulnerability patterns and gradient structures. Maps which specific knowledge areas, threat vectors, or question characteristics elicit the most dangerous responses, enabling targeted safety interventions and understanding of model knowledge structure.

Solves for

Identify which specific dangerous knowledge areas my model is most vulnerable inUnderstand whether my model has learned dangerous knowledge at different levels of specificity or difficultyDetermine which prerequisite knowledge areas are most problematicDesign targeted safety interventions based on vulnerability patterns

Best for

Safety researchers analyzing model knowledge structure and vulnerability patterns

Model developers designing targeted safety fine-tuning based on vulnerability analysis

Red-teamers systematically mapping model attack surface

Requires

Model access (API or local deployment)

WMDP benchmark with structured metadata (difficulty, specificity, prerequisites)

Computational resources for running multiple evaluations

Limitations

Gradient analysis assumes dangerous knowledge varies smoothly across dimensions, which may not hold

Vulnerability mapping is specific to WMDP benchmark questions; patterns may not generalize to other dangerous knowledge

Analysis requires running many model evaluations, increasing computational cost

What makes it unique

Maps dangerous knowledge as a multi-dimensional gradient across difficulty, specificity, and prerequisite knowledge rather than treating it as a binary present/absent property, enabling fine-grained vulnerability analysis

vs alternatives

More actionable than binary safety pass/fail metrics because gradient analysis identifies specific vulnerability patterns that can be targeted with precision safety interventions

reproducible benchmark execution and result logging

Medium confidence

Provides standardized infrastructure for running WMDP benchmark evaluations with full reproducibility, including deterministic question ordering, response logging, evaluator annotation tracking, and version control for benchmark questions and evaluation criteria. Enables researchers to publish results with full audit trails and enables others to reproduce or extend evaluations.

Solves for

Run the WMDP benchmark on my model with full reproducibility for publicationLog all model responses and evaluator decisions for audit and analysisTrack changes to benchmark questions and evaluation criteria across versionsEnable other researchers to reproduce my safety evaluation results

Best for

Researchers publishing safety evaluation results with reproducibility requirements

Model developers documenting safety evaluations for governance compliance

Organizations maintaining audit trails for safety-critical deployments

Requires

Model access with reproducible inference (fixed seeds, deterministic sampling)

Evaluation infrastructure (scripts, databases for logging)

Version control system for benchmark questions and criteria

Limitations

Reproducibility requires fixing random seeds and model inference parameters, which may not reflect real-world variability

Full logging of model responses and evaluations creates large datasets requiring storage and management

Evaluator annotation tracking introduces overhead and requires standardized evaluation protocols

What makes it unique

Provides full reproducibility infrastructure with version control, audit trails, and evaluator tracking rather than just benchmark questions, enabling publication-grade safety evaluations with complete transparency

vs alternatives

More rigorous than ad-hoc safety evaluations because full logging and version control enable independent verification and reproduction, supporting scientific standards for safety research

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with WMDP, ranked by overlap. Discovered automatically through the match graph.

Benchmark39

Humanity's Last Exam

Hardest exam questions from thousands of experts.

interdisciplinary expert-sourced question curationmulti-discipline knowledge assessment across 2,500 expert questions

2 shared capabilities

MCP Server41

agentseal

Security toolkit for AI agents. Scan your machine for dangerous skills and MCP configs, monitor for supply chain attacks, test prompt injection resistance, and audit live MCP servers for tool poisoning.

dangerous-operation-pattern-detectionlocal-skill-inventory-scanning

2 shared capabilities

Dataset46

MMLU (Massive Multitask Language Understanding)

57-subject benchmark, the standard metric for comparing LLMs.

subject-specific knowledge decomposition and comparisonprofessional-domain knowledge evaluation

2 shared capabilities

Dataset46

TruthfulQA

817 adversarial questions measuring model truthfulness vs misconceptions.

high-stakes-domain-coverage-for-safety-critical-applications

1 shared capability

Benchmark39

GPQA

Graduate-level expert QA — unsearchable questions in biology, physics, chemistry for deep reasoning.

expert-curated unsearchable question dataset

1 shared capability

Model22

NVIDIA: Llama 3.1 Nemotron 70B Instruct

NVIDIA's Llama 3.1 Nemotron 70B is a language model designed for generating precise and useful responses. Leveraging [Llama 3.1 70B](/models/meta-llama/llama-3.1-70b-instruct) architecture and Reinforcement Learning from Human Feedback (RLHF), it excels...

multi-domain knowledge synthesis and question-answering

1 shared capability

Best For

✓AI safety researchers evaluating model alignment and unlearning techniques
✓LLM developers implementing safety-critical deployments in regulated domains
✓Red-teamers and security auditors assessing model robustness against misuse
✓Policy makers and governance bodies needing quantitative safety metrics
✓ML safety researchers developing and testing unlearning algorithms
✓Model developers implementing safety-critical fine-tuning pipelines
✓Academic teams publishing unlearning research with reproducible baselines
✓Organizations evaluating third-party unlearning services

Known Limitations

⚠Benchmark questions may not capture all possible dangerous knowledge variants or novel attack vectors
⚠Evaluation relies on human judgment for response scoring, introducing potential inconsistency across evaluators
⚠Coverage limited to three domains; emerging threat vectors outside biosecurity/cybersecurity/chemical may not be represented
⚠Static benchmark may lag behind evolving threat landscape and new dangerous capabilities
⚠Unlearning evaluation assumes dangerous knowledge can be cleanly separated from general capabilities, which may not hold in practice
⚠Benchmark may not capture all forms of knowledge retention (e.g., implicit knowledge encoded in model weights)

Requirements

Access to LLM inference API or local model deploymentAbility to run evaluation scripts (Python 3.8+)Computational resources for running full benchmark suite (varies by model size)Understanding of domain-specific dangerous knowledge to interpret resultsBase model with dangerous knowledge (typically a standard LLM)Unlearning implementation (custom code or framework integration)Computational resources for model fine-tuning and evaluationWMDP benchmark dataset and evaluation scripts

Input / Output

Accepts: LLM model weights or API endpoint, Benchmark question dataset (structured JSON with domain labels), Model checkpoint or weights, Unlearning method implementation, Dangerous knowledge test set (WMDP questions), General capability test set (for utility measurement), Domain specification (biosecurity, cybersecurity, or chemical security), Optional filters (difficulty level, specificity, prerequisite knowledge), List of models to compare (with API keys or local paths), Benchmark question set, Optional weighting for different domains or question types, Model to analyze, Dangerous knowledge questions with structured metadata, Dimension specifications (difficulty, specificity, prerequisite knowledge), Model to evaluate, WMDP benchmark questions (with version specification), Evaluation criteria and scoring rubric, Evaluator assignments and annotations

Produces: Numerical scores per domain (0-100 scale or similar), Aggregated danger capability metrics, Per-question response logs with evaluator annotations, Comparative reports across model versions, Safety metrics (reduction in dangerous knowledge scores), Utility metrics (performance on general tasks), Trade-off curves (safety vs utility across different unlearning strengths), Generalization analysis (performance on unseen dangerous questions), Dangerous knowledge questions (text), Question metadata (domain, difficulty, category, expert validation status), Reference answers or evaluation criteria, Threat vector classification, Comparative rankings (models ordered by dangerous knowledge level), Per-domain comparison tables, Visualization of safety trade-offs across models, Statistical significance tests (if applicable), Vulnerability heatmaps (knowledge areas vs danger level), Gradient analysis (how danger changes with difficulty/specificity), Vulnerability rankings (most problematic knowledge areas), Visualization of knowledge structure, Recommendations for targeted safety interventions, Complete response logs (model inputs and outputs), Evaluator annotations and scoring decisions, Benchmark execution metadata (timestamps, random seeds, model parameters), Version information (benchmark version, evaluation criteria version), Reproducibility report (sufficient to re-run evaluation)

UnfragileRank

Adoption70%(25% weight)

Quality23%(35% weight)

Ecosystem30%(25% weight)

Match Graph10%(10% weight)

Freshness100%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Benchmark

6 capabilities

Visit WMDP→

About

Weapons of Mass Destruction Proxy benchmark measuring dangerous knowledge in LLMs across biosecurity, cybersecurity, and chemical security domains, used to evaluate and develop unlearning methods for hazardous capabilities.

Alternatives to WMDP

promptfoo44Model

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.

Compare →

mlflow43Prompt

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

Compare →

promptflow41Model

Build high-quality LLM apps - from prototyping, testing to production deployment and monitoring.

Compare →

amplication43Workflow

Amplication brings order to the chaos of large-scale software development by creating Golden Paths for developers - streamlined workflows that drive consistency, enable high-quality code practices, simplify onboarding, and accelerate standardized delivery across teams.

Compare →

Are you the builder of WMDP?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

seed developer essentials

Looking for something else?

Search →

Capabilities6 decomposed

multi-domain dangerous knowledge assessment across biosecurity, cybersecurity, and chemical security

Medium confidence

Solves for

Best for

AI safety researchers evaluating model alignment and unlearning techniques

LLM developers implementing safety-critical deployments in regulated domains

Red-teamers and security auditors assessing model robustness against misuse

Requires

Access to LLM inference API or local model deployment

Ability to run evaluation scripts (Python 3.8+)

Computational resources for running full benchmark suite (varies by model size)

Limitations

Benchmark questions may not capture all possible dangerous knowledge variants or novel attack vectors

Evaluation relies on human judgment for response scoring, introducing potential inconsistency across evaluators

Coverage limited to three domains; emerging threat vectors outside biosecurity/cybersecurity/chemical may not be represented

What makes it unique

vs alternatives

More targeted and domain-specific than general safety benchmarks like HELM or TruthfulQA, with explicit focus on actionable dangerous knowledge rather than truthfulness or helpfulness metrics

unlearning method evaluation and comparison framework

Medium confidence

Solves for

Best for

ML safety researchers developing and testing unlearning algorithms

Model developers implementing safety-critical fine-tuning pipelines

Academic teams publishing unlearning research with reproducible baselines

Requires

Base model with dangerous knowledge (typically a standard LLM)

Unlearning implementation (custom code or framework integration)

Computational resources for model fine-tuning and evaluation

Limitations

Unlearning evaluation assumes dangerous knowledge can be cleanly separated from general capabilities, which may not hold in practice

Benchmark may not capture all forms of knowledge retention (e.g., implicit knowledge encoded in model weights)

Requires access to model weights or full fine-tuning capability; cannot evaluate unlearning on frozen API-only models

What makes it unique

vs alternatives

More comprehensive than single-metric safety evaluations because it explicitly measures the safety-utility trade-off, helping researchers avoid trivial solutions like model lobotomization

domain-specific dangerous knowledge question generation and curation

Medium confidence

Solves for

Best for

Safety researchers who need validated dangerous knowledge questions without creating them

Model developers implementing pre-deployment safety audits

Governance bodies establishing safety standards for LLM deployment

Requires

Access to WMDP benchmark dataset (publicly available)

Understanding of domain-specific dangerous knowledge to interpret questions

Ability to parse structured question metadata (JSON format)

Limitations

Question set is static and may not evolve as threat landscape changes

Curation process introduces human bias in what is considered 'dangerous' or 'actionable'

Questions may not capture all possible dangerous knowledge variants or novel attack vectors

What makes it unique

vs alternatives

More targeted and threat-realistic than generic adversarial question datasets because questions are validated by domain experts for actual actionability rather than theoretical harm potential

cross-model dangerous knowledge comparison and ranking

Medium confidence

Solves for

Best for

Model developers comparing safety across their own model variants

Organizations evaluating multiple LLM providers for safety-critical applications

Researchers publishing comparative safety studies

Requires

Access to multiple LLM models (API endpoints or local deployments)

Computational resources to run full benchmark on each model

Standardized evaluation infrastructure (WMDP evaluation scripts)

Limitations

Comparison assumes all models are evaluated under identical conditions, which may not reflect real-world deployment differences

Ranking is relative to WMDP benchmark only; models may have different dangerous knowledge outside these domains

Scoring methodology may favor certain model architectures or training approaches

What makes it unique

Provides standardized infrastructure for comparing dangerous knowledge across heterogeneous models rather than isolated single-model evaluations, enabling relative safety assessment and ranking

vs alternatives

More actionable than individual model safety reports because comparative rankings directly support model selection decisions, whereas isolated metrics require manual interpretation

dangerous knowledge gradient analysis and vulnerability mapping

Medium confidence

Solves for

Best for

Safety researchers analyzing model knowledge structure and vulnerability patterns

Model developers designing targeted safety fine-tuning based on vulnerability analysis

Red-teamers systematically mapping model attack surface

Requires

Model access (API or local deployment)

WMDP benchmark with structured metadata (difficulty, specificity, prerequisites)

Computational resources for running multiple evaluations

Limitations

Gradient analysis assumes dangerous knowledge varies smoothly across dimensions, which may not hold

Vulnerability mapping is specific to WMDP benchmark questions; patterns may not generalize to other dangerous knowledge

Analysis requires running many model evaluations, increasing computational cost

What makes it unique

vs alternatives

More actionable than binary safety pass/fail metrics because gradient analysis identifies specific vulnerability patterns that can be targeted with precision safety interventions

reproducible benchmark execution and result logging

Medium confidence

Solves for

Best for

Researchers publishing safety evaluation results with reproducibility requirements

Model developers documenting safety evaluations for governance compliance

Organizations maintaining audit trails for safety-critical deployments

Requires

Model access with reproducible inference (fixed seeds, deterministic sampling)

Evaluation infrastructure (scripts, databases for logging)

Version control system for benchmark questions and criteria

Limitations

Reproducibility requires fixing random seeds and model inference parameters, which may not reflect real-world variability

Full logging of model responses and evaluations creates large datasets requiring storage and management

Evaluator annotation tracking introduces overhead and requires standardized evaluation protocols

What makes it unique

vs alternatives

More rigorous than ad-hoc safety evaluations because full logging and version control enable independent verification and reproduction, supporting scientific standards for safety research

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Alternatives to WMDP

promptfoo44Model

Compare →

mlflow43Prompt

Compare →

promptflow41Model

Build high-quality LLM apps - from prototyping, testing to production deployment and monitoring.

Compare →

amplication43Workflow

Compare →

WMDP

Capabilities6 decomposed

multi-domain dangerous knowledge assessment across biosecurity, cybersecurity, and chemical security

unlearning method evaluation and comparison framework

domain-specific dangerous knowledge question generation and curation

cross-model dangerous knowledge comparison and ranking

dangerous knowledge gradient analysis and vulnerability mapping

reproducible benchmark execution and result logging

Related Artifactssharing capabilities

Humanity's Last Exam

agentseal

MMLU (Massive Multitask Language Understanding)

TruthfulQA

GPQA

NVIDIA: Llama 3.1 Nemotron 70B Instruct

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to WMDP

Are you the builder of WMDP?

Get the weekly brief

Data Sources

WMDP

Capabilities6 decomposed

multi-domain dangerous knowledge assessment across biosecurity, cybersecurity, and chemical security

unlearning method evaluation and comparison framework

domain-specific dangerous knowledge question generation and curation

cross-model dangerous knowledge comparison and ranking

dangerous knowledge gradient analysis and vulnerability mapping

reproducible benchmark execution and result logging

Related Artifactssharing capabilities

Humanity's Last Exam

agentseal

MMLU (Massive Multitask Language Understanding)

TruthfulQA

GPQA

NVIDIA: Llama 3.1 Nemotron 70B Instruct

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to WMDP

Are you the builder of WMDP?

Get the weekly brief

Data Sources