multi-class prompt harmfulness classification
Classifies incoming prompts across multiple harm categories (e.g., violence, illegal activity, sexual content, hate speech, self-harm) using a fine-tuned language model trained on diverse adversarial examples. The model learns to recognize harmful intent patterns across different risk domains simultaneously, enabling nuanced detection beyond binary safe/unsafe classification. Outputs confidence scores per harm category to support downstream risk-based routing decisions.
Unique: Trained on WildGuard's curated dataset of 10K+ adversarial prompts spanning 13 harm categories with human annotations, using a multi-task learning approach that jointly optimizes for prompt harmfulness, response harmfulness, and refusal detection — enabling a single model to handle three safety dimensions rather than separate classifiers
vs alternatives: More comprehensive than OpenAI's moderation API (covers more harm categories) and more specialized than generic text classifiers because it's specifically fine-tuned on jailbreak and adversarial prompt patterns rather than general toxicity
response harmfulness detection and classification
Analyzes LLM-generated responses to identify harmful content that slipped past prompt filtering, classifying violations across the same harm taxonomy as prompt detection. Uses a separate classification head trained on model outputs paired with human safety judgments, enabling detection of harmful content generation even when the initial prompt appeared benign. Supports both full-response analysis and streaming token-level detection for real-time filtering.
Unique: Specifically trained on LLM-generated text rather than generic harmful content, using a dataset of model outputs paired with human safety judgments — captures model-specific failure modes (e.g., verbose harmful explanations) that generic classifiers miss
vs alternatives: More effective than post-hoc content filters (like regex or keyword matching) because it understands semantic intent and can detect harmful content expressed in novel ways; more targeted than general toxicity classifiers because it's calibrated for LLM output patterns
refusal detection and classification
Identifies when an LLM refuses to answer a prompt and classifies the refusal reason (safety concern, capability limitation, policy violation, etc.) using a specialized classifier trained on refusal patterns. This enables distinguishing between legitimate refusals (model correctly declining harmful requests) and false refusals (model unnecessarily blocking benign requests), supporting both safety auditing and user experience optimization. Outputs refusal confidence and category to enable downstream handling (e.g., rephrasing suggestions, escalation).
Unique: Treats refusal detection as a distinct classification task rather than a binary safe/unsafe decision, enabling fine-grained analysis of model behavior — captures the nuance that some refusals are appropriate (blocking harmful requests) while others are false positives (blocking benign requests)
vs alternatives: More sophisticated than simple keyword matching for refusal detection because it understands semantic refusal patterns; enables safety auditing that generic classifiers cannot support by categorizing refusal reasons
curated adversarial prompt dataset with human annotations
Provides a structured dataset of 10K+ adversarial prompts spanning 13 harm categories, each annotated by human raters for prompt harmfulness, response harmfulness, and refusal appropriateness. The dataset includes diverse attack patterns (jailbreaks, prompt injections, social engineering) and edge cases, enabling researchers and builders to train, evaluate, and benchmark safety classifiers. Supports both supervised fine-tuning of safety models and evaluation of existing LLM safety mechanisms.
Unique: Combines three annotation dimensions (prompt harmfulness, response harmfulness, refusal appropriateness) in a single dataset, enabling multi-task learning and comprehensive safety evaluation — most public datasets focus on only one dimension
vs alternatives: More comprehensive than generic toxicity datasets (e.g., Jigsaw) because it's specifically curated for adversarial prompts and LLM jailbreaks; more detailed than simple safe/unsafe labels because it provides fine-grained harm categories and multi-dimensional annotations
pre-trained safety classifier model with multi-task learning
Provides a fine-tuned language model (based on Llama 2 or similar backbone) trained via multi-task learning to simultaneously predict prompt harmfulness, response harmfulness, and refusal appropriateness. The model uses shared representations for all three tasks, enabling efficient inference and transfer learning across safety dimensions. Available in multiple sizes (7B, 13B parameters) to support different latency/accuracy trade-offs in production deployments.
Unique: Uses multi-task learning with shared representations across three safety dimensions (prompt harm, response harm, refusal appropriateness) rather than separate single-task models, reducing model size and inference latency while improving generalization through task-specific regularization
vs alternatives: More efficient than running three separate safety classifiers because it shares parameters and inference compute; more accurate than single-task models on individual tasks due to regularization from auxiliary tasks; more flexible than API-based safety services because it runs locally without network latency or data transmission concerns
harm category taxonomy and annotation schema
Defines a structured taxonomy of 13 harm categories (violence, illegal activity, sexual content, hate speech, self-harm, etc.) with clear definitions and annotation guidelines for consistent human labeling. The schema supports multi-label annotation (a single prompt can belong to multiple categories) and confidence scoring, enabling nuanced safety classification beyond binary safe/unsafe decisions. Includes inter-rater agreement metrics and quality control procedures for maintaining annotation consistency.
Unique: Provides a comprehensive 13-category taxonomy specifically designed for LLM safety rather than generic content moderation, with multi-label support enabling fine-grained classification of prompts that span multiple harm dimensions
vs alternatives: More detailed than OpenAI's moderation API categories (which uses ~6 categories) and more LLM-specific than general content moderation taxonomies; enables richer safety analysis and more targeted mitigation strategies
evaluation benchmark for safety classifier performance
Provides standardized evaluation metrics and benchmark results for safety classifiers, including precision, recall, F1-score, and ROC-AUC across all 13 harm categories. Enables comparison of different safety approaches (API-based, fine-tuned models, rule-based systems) on a common test set with consistent evaluation methodology. Includes ablation studies showing the contribution of different training techniques (multi-task learning, data augmentation, etc.) to overall performance.
Unique: Provides multi-dimensional evaluation across 13 harm categories with per-category metrics rather than a single aggregate score, enabling fine-grained analysis of safety classifier performance and identification of specific weaknesses
vs alternatives: More comprehensive than simple accuracy metrics because it includes precision, recall, and ROC-AUC; more actionable than generic benchmarks because it's specific to safety classification and includes category-level breakdowns
fine-tuning framework for domain-specific safety models
Provides training scripts, loss functions, and hyperparameter configurations for fine-tuning the WildGuard base model on domain-specific safety concerns with minimal labeled data. Implements techniques like low-rank adaptation (LoRA), data augmentation, and curriculum learning to improve sample efficiency and reduce overfitting. Includes evaluation utilities for monitoring validation performance and early stopping to prevent degradation on the original safety tasks.
Unique: Provides end-to-end fine-tuning infrastructure with parameter-efficient techniques (LoRA) and multi-task regularization to prevent catastrophic forgetting, enabling safe domain adaptation without requiring full model retraining or massive labeled datasets
vs alternatives: More efficient than fine-tuning from scratch because it leverages pre-trained representations; more practical than API-based safety services because it enables customization without vendor lock-in; more accessible than building custom classifiers from scratch because it provides templates and best practices
+1 more capabilities