multi-domain dangerous knowledge assessment across biosecurity, cybersecurity, and chemical security, unlearning method evaluation and comparison framework, expert-annotated hazard rubric scoring system, cross-domain dangerous knowledge correlation analysis, benchmark dataset versioning and curation pipeline, model-agnostic inference abstraction for diverse llm architectures, statistical significance testing and confidence interval estimation, red-teaming and adversarial prompt generation for benchmark validation

WMDP

BenchmarkFree

Benchmark for dangerous knowledge in LLMs.

Open Source

/ 100

8 capabilities

Capabilities8 decomposed

multi-domain dangerous knowledge assessment across biosecurity, cybersecurity, and chemical security

Medium confidence

Evaluates LLM outputs against curated question sets spanning three distinct hazard domains (biosecurity, cybersecurity, chemical security) using domain-expert-validated benchmarks. The assessment framework maps model responses to risk levels within each domain, enabling quantitative measurement of dangerous capability presence. Responses are scored against rubrics developed by security domain experts to identify whether models can produce actionable harmful information.

Solves for

measure whether an LLM has learned dangerous knowledge across multiple security domainsidentify which hazard categories a model is most vulnerable totrack dangerous capability regression across model versions and training approachesbenchmark unlearning method effectiveness by comparing pre/post-intervention scores

Best for

AI safety researchers developing unlearning techniques

model developers evaluating safety before deployment

red-teamers assessing LLM vulnerability to misuse

Requires

access to model inference API or local model weights

ability to run model inference at scale (hundreds to thousands of prompts)

Python 3.8+ for benchmark harness

Limitations

benchmark questions may become stale as adversarial techniques evolve; requires periodic expert review and updates

scoring relies on human expert annotation which introduces subjectivity; inter-rater reliability not fully documented

covers only three domains; emerging hazard categories (e.g., autonomous systems, synthetic biology) may not be represented

What makes it unique

Combines expert-validated questions across three distinct security domains (biosecurity, cybersecurity, chemical) into a unified benchmark framework, rather than treating each domain separately. Uses domain-expert rubrics for scoring rather than automated classifiers, ensuring nuanced assessment of harmful capability presence.

vs alternatives

More comprehensive than single-domain safety benchmarks (e.g., ToxiGen for toxicity) because it measures dangerous knowledge across multiple hazard categories simultaneously, enabling holistic safety evaluation.

unlearning method evaluation and comparison framework

Medium confidence

Provides standardized evaluation infrastructure to measure the effectiveness of unlearning techniques (methods that remove dangerous capabilities from trained models) by comparing model performance before and after unlearning interventions. The framework isolates the impact of unlearning by holding the benchmark constant while varying the model state, enabling quantitative assessment of whether dangerous knowledge has been successfully suppressed.

Solves for

measure whether an unlearning technique successfully removes dangerous capabilitiescompare effectiveness across different unlearning methods (e.g., gradient ascent vs. data deletion)detect capability degradation or side effects from unlearning (e.g., loss of benign knowledge)establish baseline metrics for publishing unlearning research

Best for

researchers developing new unlearning algorithms

safety teams validating unlearning before model release

organizations comparing commercial unlearning services

Requires

two versions of the same model (pre-unlearning and post-unlearning)

ability to run inference on both model versions

computational budget for full benchmark evaluation (typically 1-4 hours per model on GPU)

Limitations

benchmark measures only what is explicitly tested; unlearning may fail on out-of-distribution dangerous queries not in the benchmark

does not measure capability recovery through fine-tuning or prompt engineering; adversaries may circumvent unlearning

requires access to model weights or fine-tuning APIs; cannot evaluate proprietary models without vendor cooperation

What makes it unique

Provides a standardized evaluation harness specifically designed for unlearning research, with built-in comparison logic and side-effect detection. Unlike generic benchmarks, it explicitly measures delta between model states and flags unintended capability loss.

vs alternatives

More rigorous than ad-hoc unlearning evaluation because it enforces consistent benchmark administration, statistical testing, and side-effect measurement across all methods being compared.

expert-annotated hazard rubric scoring system

Medium confidence

Implements a structured scoring framework where model responses to dangerous knowledge questions are evaluated against expert-developed rubrics that assess the degree of hazard (e.g., specificity, actionability, completeness of harmful information). Responses are scored on multi-point scales (typically 0-4 or 0-5) rather than binary pass/fail, capturing nuance in how dangerous a model's output actually is. Rubrics are domain-specific (biosecurity, cybersecurity, chemical) and developed by subject matter experts to ensure validity.

Solves for

quantify the severity of dangerous knowledge in model outputs, not just presence/absenceenable fine-grained comparison of models with similar overall safety profilesidentify which types of dangerous information a model is most prone to generatingprovide interpretable scores for stakeholders (e.g., 'model scores 2.3/5 on biosecurity, indicating moderate hazard')

Best for

safety researchers needing nuanced capability measurement

model developers comparing safety across versions

auditors assessing model risk for deployment decisions

Requires

access to model outputs (text responses to benchmark questions)

trained annotators or automated scoring system calibrated to rubrics

rubric documentation (typically 5-20 pages per domain with examples)

Limitations

rubric development requires domain expertise; creating new rubrics for emerging hazards is time-consuming

inter-rater reliability depends on rubric clarity and annotator training; disagreement rates not always reported

rubrics may encode subjective judgments about what constitutes 'actionable' or 'specific' harmful information

What makes it unique

Uses domain-expert-developed multi-point rubrics rather than automated classifiers or binary labels, enabling nuanced assessment of dangerous knowledge severity. Rubrics are calibrated to distinguish between vague, incomplete, and highly actionable harmful information.

vs alternatives

More interpretable and defensible than black-box classifiers because rubric criteria are explicit and expert-validated; enables stakeholders to understand why a response received a particular score.

cross-domain dangerous knowledge correlation analysis

Medium confidence

Analyzes patterns in how dangerous knowledge correlates across the three benchmark domains (biosecurity, cybersecurity, chemical security), identifying whether models that excel at suppressing one type of hazard tend to suppress others. The analysis uses statistical correlation and clustering techniques to reveal whether dangerous capabilities are independent or coupled in model behavior. This enables understanding of whether unlearning interventions have domain-specific or global effects.

Solves for

understand whether dangerous knowledge in different domains stems from shared model mechanismspredict whether unlearning in one domain will affect capabilities in other domainsidentify models with domain-specific vulnerabilities (e.g., strong on biosecurity but weak on cybersecurity)design targeted unlearning interventions that address specific domain weaknesses

Best for

safety researchers studying the structure of dangerous capabilities in LLMs

model developers optimizing safety across multiple hazard categories

organizations with domain-specific risk profiles (e.g., biotech companies prioritizing biosecurity)

Requires

scores from at least 10-20 models across all three domains

Python 3.8+ with scipy/numpy for statistical analysis

domain expertise to interpret correlations meaningfully

Limitations

correlation analysis requires evaluation of many models; computationally expensive

correlations may be spurious if domains are confounded (e.g., both require technical knowledge)

small sample size of models in early benchmark versions limits statistical power

What makes it unique

Explicitly analyzes relationships between dangerous knowledge across domains rather than treating each domain independently. Enables discovery of whether hazards are coupled or independent in model behavior.

vs alternatives

Provides deeper insight than single-domain benchmarks by revealing how safety properties interact across different hazard categories, informing more effective unlearning strategies.

benchmark dataset versioning and curation pipeline

Medium confidence

Manages the creation, validation, and versioning of benchmark questions and rubrics through a structured curation pipeline involving domain experts, adversarial testing, and iterative refinement. The pipeline ensures questions are sufficiently difficult to elicit dangerous knowledge without being unrealistic, and rubrics are calibrated through inter-rater agreement studies. Version control enables tracking of benchmark evolution and ensures reproducibility across research papers.

Solves for

maintain a high-quality, expert-validated benchmark that doesn't become staleenable reproducible research by providing versioned benchmark datasetsincorporate new hazard categories and attack vectors as threats evolvedocument benchmark design decisions and limitations for transparency

Best for

benchmark maintainers ensuring quality and freshness

researchers needing reproducible benchmark versions for publications

organizations tracking benchmark evolution over time

Requires

domain expert panel (typically 3-5 experts per domain)

version control system (Git or similar)

inter-rater agreement validation infrastructure

Limitations

curation is labor-intensive; requires ongoing expert involvement

versioning can fragment research community if different papers use different benchmark versions

adversarial testing may not catch all ways models can produce dangerous outputs

What makes it unique

Implements a formal curation pipeline with expert validation and inter-rater agreement checks, rather than ad-hoc question collection. Versioning enables reproducible research and transparent tracking of benchmark evolution.

vs alternatives

More rigorous than informal benchmarks because it enforces expert review, inter-rater validation, and version control, reducing bias and enabling reproducible comparisons across papers.

model-agnostic inference abstraction for diverse llm architectures

Medium confidence

Provides a unified interface for evaluating diverse LLM architectures (open-source models, API-based models, fine-tuned variants) by abstracting away implementation differences. The abstraction handles API calls (OpenAI, Anthropic, etc.), local inference (Hugging Face, Ollama), and custom model serving, enabling consistent benchmark administration across heterogeneous model types. This enables fair comparison between models with different deployment modalities.

Solves for

evaluate both open-source and proprietary models using the same benchmarkcompare models deployed via APIs vs. locally without reimplementing benchmark logicsupport emerging model architectures without modifying benchmark codeenable researchers to test their own fine-tuned models alongside baselines

Best for

researchers comparing diverse model families (open vs. closed, different scales)

organizations evaluating both internal and external models

benchmark maintainers supporting multiple model deployment options

Requires

Python 3.8+ with model-specific SDKs (transformers, openai, anthropic, etc.)

API keys for proprietary models (OpenAI, Anthropic, etc.)

GPU or TPU for local inference (optional but recommended)

Limitations

abstraction adds latency (typically 50-200ms per request) due to interface overhead

API-based models may have rate limits or cost implications for large-scale evaluation

local inference requires sufficient computational resources; not all models fit on consumer hardware

What makes it unique

Abstracts away differences between API-based, local, and custom-deployed models through a unified interface, enabling fair comparison without reimplementing benchmark logic for each model type.

vs alternatives

More flexible than model-specific benchmarks because it supports any LLM architecture without code changes, reducing friction for researchers evaluating new models.

statistical significance testing and confidence interval estimation

Medium confidence

Implements rigorous statistical testing to determine whether differences in dangerous knowledge scores between models or unlearning methods are statistically significant or due to random variation. Uses techniques like bootstrap confidence intervals, permutation tests, and effect size estimation to quantify uncertainty in benchmark results. This prevents overconfident claims about safety improvements that may not be robust.

Solves for

determine whether an unlearning method genuinely reduces dangerous knowledge or just got luckycompare two models and quantify confidence in the comparisonestimate how much dangerous knowledge remains with confidence boundsdetect when benchmark sample size is insufficient for reliable conclusions

Best for

researchers publishing safety claims that require statistical rigor

organizations making deployment decisions based on safety metrics

peer reviewers validating safety research

Requires

Python 3.8+ with scipy.stats for statistical tests

at least 20-30 benchmark questions per domain for reliable estimation

understanding of statistical concepts (p-values, confidence intervals, effect sizes)

Limitations

statistical testing assumes benchmark questions are representative; biased question set invalidates tests

multiple comparisons (e.g., comparing many models) require correction (Bonferroni, FDR) which reduces power

small sample sizes (few models or few questions) limit statistical power; may fail to detect real differences

What makes it unique

Integrates formal statistical testing into the benchmark evaluation pipeline rather than relying on point estimates, ensuring claims about safety improvements are statistically justified.

vs alternatives

More rigorous than informal comparisons because it quantifies uncertainty and prevents overconfident claims about safety improvements that may not be robust to sampling variation.

red-teaming and adversarial prompt generation for benchmark validation

Medium confidence

Employs adversarial testing techniques to validate that benchmark questions reliably elicit dangerous knowledge and cannot be easily circumvented by prompt engineering. Red-teamers attempt to find questions that fail to elicit dangerous knowledge or rubric edge cases, and the benchmark is iteratively refined based on findings. This ensures the benchmark is robust to adversarial adaptation and captures genuine dangerous capabilities rather than surface-level patterns.

Solves for

identify benchmark questions that are too easy or too hard to elicit dangerous knowledgediscover prompt engineering techniques that allow models to evade benchmark questionsvalidate that rubric scores are robust to paraphrasing or obfuscationensure benchmark remains effective as adversaries develop new attack techniques

Best for

benchmark maintainers ensuring robustness to adversarial adaptation

safety researchers studying prompt engineering defenses

organizations validating that safety metrics cannot be gamed

Requires

red-teaming team (typically 2-5 skilled adversarial testers)

access to models being tested

documentation of red-teaming methodology and findings

Limitations

red-teaming is labor-intensive; requires skilled adversarial testers

adversarial testing may not discover all possible evasion techniques

benchmark updates based on red-teaming may introduce new biases

What makes it unique

Incorporates formal red-teaming into the benchmark validation pipeline rather than assuming questions are robust, ensuring the benchmark remains effective against adversarial adaptation.

vs alternatives

More robust than static benchmarks because it actively searches for evasion techniques and iteratively refines questions, reducing the risk that models can circumvent the benchmark through prompt engineering.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with WMDP, ranked by overlap. Discovered automatically through the match graph.

Dataset59

WildGuard

Allen AI's safety classification dataset and model.

evaluation benchmark for safety classifier performanceharm category taxonomy and annotation schemacurated adversarial prompt dataset with human annotationsmulti-class prompt harmfulness classification

4 shared capabilities

Benchmark62

Humanity's Last Exam

Hardest exam questions from thousands of experts.

expert-curated multidisciplinary exam question compilationmultidisciplinary expert curation across 100+ contributors

2 shared capabilities

Benchmark63

SafetyBench Eval

11K safety evaluation questions across 7 categories.

seven-category safety taxonomy and question curationmulti-category llm safety evaluation via multiple-choice questions

2 shared capabilities

Benchmark63

SafetyBench

11K safety evaluation questions across 7 categories.

7-category safety taxonomy with fine-grained failure mode classificationmultilingual safety evaluation dataset with category-stratified sampling

2 shared capabilities

Model22

OpenAI: gpt-oss-safeguard-20b

gpt-oss-safeguard-20b is a safety reasoning model from OpenAI built upon gpt-oss-20b. This open-weight, 21B-parameter Mixture-of-Experts (MoE) model offers lower latency for safety tasks like content classification, LLM filtering, and trust...

multi-label safety classification with confidence scoring

1 shared capability

Agent58

ChemCrow

AI agent with chemistry tools for synthesis planning.

chemical safety assessment and hazard prediction

1 shared capability

Best For

✓AI safety researchers developing unlearning techniques
✓model developers evaluating safety before deployment
✓red-teamers assessing LLM vulnerability to misuse
✓policy makers needing quantitative dangerous capability metrics
✓researchers developing new unlearning algorithms
✓safety teams validating unlearning before model release
✓organizations comparing commercial unlearning services
✓academic groups publishing safety research

Known Limitations

⚠benchmark questions may become stale as adversarial techniques evolve; requires periodic expert review and updates
⚠scoring relies on human expert annotation which introduces subjectivity; inter-rater reliability not fully documented
⚠covers only three domains; emerging hazard categories (e.g., autonomous systems, synthetic biology) may not be represented
⚠static question set may not capture all ways a model could produce dangerous outputs; adversaries may find novel phrasings
⚠benchmark measures only what is explicitly tested; unlearning may fail on out-of-distribution dangerous queries not in the benchmark
⚠does not measure capability recovery through fine-tuning or prompt engineering; adversaries may circumvent unlearning

Requirements

access to model inference API or local model weightsability to run model inference at scale (hundreds to thousands of prompts)Python 3.8+ for benchmark harnesscomputational resources for batch evaluation (GPU recommended for large models)two versions of the same model (pre-unlearning and post-unlearning)ability to run inference on both model versionscomputational budget for full benchmark evaluation (typically 1-4 hours per model on GPU)Python 3.8+ and PyTorch or TensorFlow for model manipulation

Input / Output

Accepts: text prompts (natural language questions designed to elicit dangerous knowledge), model identifiers (to specify which LLM to evaluate), model checkpoint (pre-unlearning state), unlearning intervention (code, weights, or configuration), model checkpoint (post-unlearning state), model response text (free-form LLM output), question context (the prompt that elicited the response), domain identifier (biosecurity, cybersecurity, or chemical), per-model, per-domain aggregate scores (e.g., biosecurity score, cybersecurity score, chemical score for each model), candidate benchmark questions (text), proposed rubric updates (rubric documents), model responses for inter-rater validation (text), model identifier (e.g., 'gpt-4', 'meta-llama/Llama-2-7b', 'claude-3-opus'), prompt text (benchmark question), model configuration (temperature, max_tokens, etc.), per-question scores (numeric values for each benchmark question), model identifiers or intervention labels (for grouping comparisons), benchmark questions (to be tested for robustness), model responses (to identify evasion patterns), rubric definitions (to find edge cases)

Produces: structured JSON with per-domain scores (0-100 scale), per-question response transcripts with risk annotations, aggregate statistics (mean, std dev, percentile distributions across domains), delta scores (difference in dangerous knowledge before/after unlearning), per-domain effectiveness metrics, side-effect analysis (benign capability retention scores), statistical significance tests comparing pre/post distributions, numeric score (e.g., 0-4 scale per response), score distribution (histogram of scores across all responses in a domain), aggregate metrics (mean, median, 95th percentile scores), annotator confidence or uncertainty estimates, correlation matrix (3x3 showing pairwise domain correlations), clustering visualization (grouping models by domain vulnerability profiles), statistical tests (Pearson/Spearman correlation with p-values), domain-specific vulnerability profiles (e.g., 'high biosecurity risk, low cybersecurity risk'), versioned benchmark dataset (JSON with questions, rubrics, metadata), inter-rater agreement reports (Cohen's kappa, Fleiss' kappa), changelog documenting updates between versions, benchmark documentation (design rationale, limitations, usage guide), model response text, metadata (latency, token count, API cost if applicable), error handling (graceful fallback for rate limits or API failures), p-values (significance tests), confidence intervals (e.g., 95% CI for mean score), effect sizes (Cohen's d, Hedges' g), power analysis (sample size needed for desired statistical power), red-teaming report (findings, evasion techniques discovered, recommendations), updated benchmark questions (refined based on red-teaming findings), rubric clarifications (addressing edge cases found during testing), adversarial prompt examples (demonstrating evasion techniques)

UnfragileRank

Adoption70%(25% weight)

Quality85%(35% weight)

Ecosystem40%(15% weight)

Match Graph25%(20% weight)

Freshness100%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Benchmark

8 capabilities

Visit WMDP→

About

Weapons of Mass Destruction Proxy benchmark measuring dangerous knowledge in LLMs across biosecurity, cybersecurity, and chemical security domains, used to evaluate and develop unlearning methods for hazardous capabilities.

Alternatives to WMDP

v087Product

AI UI generator by Vercel — creates production-quality React/Next.js components from natural language descriptions.

Compare →

Framer82Product

AI-powered website design and publishing — generates responsive, professionally designed sites from descriptions.

Compare →

Midjourney79Product

AI image generation — artistic high-quality outputs, Discord bot, photorealistic V6 model.

Compare →

xCodeEval67Benchmark

Multilingual code evaluation across 17 languages.

Compare →

Are you the builder of WMDP?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

seed developer essentials

Looking for something else?

Search →

Capabilities8 decomposed

multi-domain dangerous knowledge assessment across biosecurity, cybersecurity, and chemical security

Medium confidence

Solves for

Best for

AI safety researchers developing unlearning techniques

model developers evaluating safety before deployment

red-teamers assessing LLM vulnerability to misuse

Requires

access to model inference API or local model weights

ability to run model inference at scale (hundreds to thousands of prompts)

Python 3.8+ for benchmark harness

Limitations

benchmark questions may become stale as adversarial techniques evolve; requires periodic expert review and updates

scoring relies on human expert annotation which introduces subjectivity; inter-rater reliability not fully documented

covers only three domains; emerging hazard categories (e.g., autonomous systems, synthetic biology) may not be represented

What makes it unique

vs alternatives

unlearning method evaluation and comparison framework

Medium confidence

Solves for

Best for

researchers developing new unlearning algorithms

safety teams validating unlearning before model release

organizations comparing commercial unlearning services

Requires

two versions of the same model (pre-unlearning and post-unlearning)

ability to run inference on both model versions

computational budget for full benchmark evaluation (typically 1-4 hours per model on GPU)

Limitations

benchmark measures only what is explicitly tested; unlearning may fail on out-of-distribution dangerous queries not in the benchmark

does not measure capability recovery through fine-tuning or prompt engineering; adversaries may circumvent unlearning

requires access to model weights or fine-tuning APIs; cannot evaluate proprietary models without vendor cooperation

What makes it unique

vs alternatives

More rigorous than ad-hoc unlearning evaluation because it enforces consistent benchmark administration, statistical testing, and side-effect measurement across all methods being compared.

expert-annotated hazard rubric scoring system

Medium confidence

Solves for

Best for

safety researchers needing nuanced capability measurement

model developers comparing safety across versions

auditors assessing model risk for deployment decisions

Requires

access to model outputs (text responses to benchmark questions)

trained annotators or automated scoring system calibrated to rubrics

rubric documentation (typically 5-20 pages per domain with examples)

Limitations

rubric development requires domain expertise; creating new rubrics for emerging hazards is time-consuming

inter-rater reliability depends on rubric clarity and annotator training; disagreement rates not always reported

rubrics may encode subjective judgments about what constitutes 'actionable' or 'specific' harmful information

What makes it unique

vs alternatives

More interpretable and defensible than black-box classifiers because rubric criteria are explicit and expert-validated; enables stakeholders to understand why a response received a particular score.

cross-domain dangerous knowledge correlation analysis

Medium confidence

Solves for

Best for

safety researchers studying the structure of dangerous capabilities in LLMs

model developers optimizing safety across multiple hazard categories

organizations with domain-specific risk profiles (e.g., biotech companies prioritizing biosecurity)

Requires

scores from at least 10-20 models across all three domains

Python 3.8+ with scipy/numpy for statistical analysis

domain expertise to interpret correlations meaningfully

Limitations

correlation analysis requires evaluation of many models; computationally expensive

correlations may be spurious if domains are confounded (e.g., both require technical knowledge)

small sample size of models in early benchmark versions limits statistical power

What makes it unique

vs alternatives

Provides deeper insight than single-domain benchmarks by revealing how safety properties interact across different hazard categories, informing more effective unlearning strategies.

benchmark dataset versioning and curation pipeline

Medium confidence

Solves for

Best for

benchmark maintainers ensuring quality and freshness

researchers needing reproducible benchmark versions for publications

organizations tracking benchmark evolution over time

Requires

domain expert panel (typically 3-5 experts per domain)

version control system (Git or similar)

inter-rater agreement validation infrastructure

Limitations

curation is labor-intensive; requires ongoing expert involvement

versioning can fragment research community if different papers use different benchmark versions

adversarial testing may not catch all ways models can produce dangerous outputs

What makes it unique

vs alternatives

More rigorous than informal benchmarks because it enforces expert review, inter-rater validation, and version control, reducing bias and enabling reproducible comparisons across papers.

model-agnostic inference abstraction for diverse llm architectures

Medium confidence

Solves for

Best for

researchers comparing diverse model families (open vs. closed, different scales)

organizations evaluating both internal and external models

benchmark maintainers supporting multiple model deployment options

Requires

Python 3.8+ with model-specific SDKs (transformers, openai, anthropic, etc.)

API keys for proprietary models (OpenAI, Anthropic, etc.)

GPU or TPU for local inference (optional but recommended)

Limitations

abstraction adds latency (typically 50-200ms per request) due to interface overhead

API-based models may have rate limits or cost implications for large-scale evaluation

local inference requires sufficient computational resources; not all models fit on consumer hardware

What makes it unique

Abstracts away differences between API-based, local, and custom-deployed models through a unified interface, enabling fair comparison without reimplementing benchmark logic for each model type.

vs alternatives

More flexible than model-specific benchmarks because it supports any LLM architecture without code changes, reducing friction for researchers evaluating new models.

statistical significance testing and confidence interval estimation

Medium confidence

Solves for

Best for

researchers publishing safety claims that require statistical rigor

organizations making deployment decisions based on safety metrics

peer reviewers validating safety research

Requires

Python 3.8+ with scipy.stats for statistical tests

at least 20-30 benchmark questions per domain for reliable estimation

understanding of statistical concepts (p-values, confidence intervals, effect sizes)

Limitations

statistical testing assumes benchmark questions are representative; biased question set invalidates tests

multiple comparisons (e.g., comparing many models) require correction (Bonferroni, FDR) which reduces power

small sample sizes (few models or few questions) limit statistical power; may fail to detect real differences

What makes it unique

Integrates formal statistical testing into the benchmark evaluation pipeline rather than relying on point estimates, ensuring claims about safety improvements are statistically justified.

vs alternatives

More rigorous than informal comparisons because it quantifies uncertainty and prevents overconfident claims about safety improvements that may not be robust to sampling variation.

red-teaming and adversarial prompt generation for benchmark validation

Medium confidence

Solves for

Best for

benchmark maintainers ensuring robustness to adversarial adaptation

safety researchers studying prompt engineering defenses

organizations validating that safety metrics cannot be gamed

Requires

red-teaming team (typically 2-5 skilled adversarial testers)

access to models being tested

documentation of red-teaming methodology and findings

Limitations

red-teaming is labor-intensive; requires skilled adversarial testers

adversarial testing may not discover all possible evasion techniques

benchmark updates based on red-teaming may introduce new biases

What makes it unique

Incorporates formal red-teaming into the benchmark validation pipeline rather than assuming questions are robust, ensuring the benchmark remains effective against adversarial adaptation.

vs alternatives

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Alternatives to WMDP

v087Product

AI UI generator by Vercel — creates production-quality React/Next.js components from natural language descriptions.

Compare →

Framer82Product

AI-powered website design and publishing — generates responsive, professionally designed sites from descriptions.

Compare →

Midjourney79Product

AI image generation — artistic high-quality outputs, Discord bot, photorealistic V6 model.

Compare →

xCodeEval67Benchmark

Multilingual code evaluation across 17 languages.

Compare →

WMDP

Capabilities8 decomposed

multi-domain dangerous knowledge assessment across biosecurity, cybersecurity, and chemical security

unlearning method evaluation and comparison framework

expert-annotated hazard rubric scoring system

cross-domain dangerous knowledge correlation analysis

benchmark dataset versioning and curation pipeline

model-agnostic inference abstraction for diverse llm architectures

statistical significance testing and confidence interval estimation

red-teaming and adversarial prompt generation for benchmark validation

Related Artifactssharing capabilities

WildGuard

Humanity's Last Exam

SafetyBench Eval

SafetyBench

OpenAI: gpt-oss-safeguard-20b

ChemCrow

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to WMDP

Are you the builder of WMDP?

Get the weekly brief

Data Sources

WMDP

Capabilities8 decomposed

multi-domain dangerous knowledge assessment across biosecurity, cybersecurity, and chemical security

unlearning method evaluation and comparison framework

expert-annotated hazard rubric scoring system

cross-domain dangerous knowledge correlation analysis

benchmark dataset versioning and curation pipeline

model-agnostic inference abstraction for diverse llm architectures

statistical significance testing and confidence interval estimation

red-teaming and adversarial prompt generation for benchmark validation

Related Artifactssharing capabilities

WildGuard

Humanity's Last Exam

SafetyBench Eval

SafetyBench

OpenAI: gpt-oss-safeguard-20b

ChemCrow

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to WMDP

Are you the builder of WMDP?

Get the weekly brief

Data Sources