{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"safetybench-eval","slug":"safetybench-eval","name":"SafetyBench Eval","type":"benchmark","url":"https://github.com/thu-coai/SafetyBench","page_url":"https://unfragile.ai/safetybench-eval","categories":["testing-quality","observability"],"tags":[],"pricing":{"model":"free","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"safetybench-eval__cap_0","uri":"capability://safety.moderation.multi.category.llm.safety.evaluation.via.multiple.choice.questions","name":"multi-category llm safety evaluation via multiple-choice questions","description":"Evaluates LLM safety across 7 distinct categories (offensiveness, unfairness, physical health, mental health, illegal activities, ethics, privacy) using 11,435 curated multiple-choice questions available in both Chinese and English. The benchmark constructs category-specific prompts, sends them to target models, extracts predicted answers from model responses, and compares against ground-truth labels (0->A, 1->B, 2->C, 3->D) to compute accuracy metrics per category and overall safety score.","intents":["Measure whether my LLM correctly refuses or handles safety-critical prompts across diverse harm categories","Compare safety performance of different models on a standardized, reproducible benchmark","Identify which safety categories my model is weakest in to prioritize alignment work","Validate that my fine-tuned or RLHF-trained model maintains safety guarantees across languages"],"best_for":["AI safety researchers evaluating proprietary and open-source LLMs","Teams building multilingual LLM products needing safety validation","Organizations submitting models to safety leaderboards and benchmarks","Academic groups studying cross-lingual safety alignment"],"limitations":["Multiple-choice format may not capture nuanced safety failures in open-ended generation","Fixed question set limits ability to detect novel or adversarial safety bypasses not in the benchmark","Evaluation requires API access or local model deployment; no built-in support for proprietary closed-source APIs beyond examples","Chinese subset (test_zh_subset.json) is filtered for sensitive keywords, potentially reducing coverage of edge cases","No dynamic or adaptive questioning — cannot follow up on ambiguous model responses"],"requires":["Python 3.6+","Internet connection to download from Hugging Face (~20MB dataset)","Access to an LLM (local, API-based, or cloud-hosted)","JSON parsing capability for question/answer extraction"],"input_types":["Multiple-choice questions (JSON: id, category, question text, 4 options, ground-truth answer)","Model API endpoints or local model instances","Optional: few-shot examples (5 per category from dev_en.json or dev_zh.json)"],"output_types":["Predicted answer labels (0->A, 1->B, 2->C, 3->D) per question","Accuracy metrics per safety category","Overall safety score","JSON submission format for leaderboard (question_id -> predicted_answer)"],"categories":["safety-moderation","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"safetybench-eval__cap_1","uri":"capability://planning.reasoning.zero.shot.and.few.shot.evaluation.mode.switching","name":"zero-shot and few-shot evaluation mode switching","description":"Supports two distinct evaluation paradigms: zero-shot (questions presented directly without examples) and five-shot (5 category-specific examples provided before each test question). The framework conditionally constructs prompts using dev_en.json/dev_zh.json few-shot examples or omits them entirely, allowing researchers to measure how in-context learning affects safety performance. Prompt templates are language-aware and can be customized per model to improve answer extraction accuracy.","intents":["Measure whether my model's safety performance degrades when given in-context examples of unsafe behavior","Determine if few-shot prompting helps or hurts safety alignment on my target LLM","Compare zero-shot vs few-shot safety scores to understand in-context learning effects","Adapt prompt templates for models with non-standard output formats"],"best_for":["Researchers studying in-context learning effects on safety","Teams optimizing prompt engineering for safety-critical applications","Builders comparing model robustness across different prompting strategies"],"limitations":["Few-shot examples are fixed (5 per category) — no dynamic example selection based on model behavior","Prompt template customization requires manual intervention per model; no automated prompt optimization","No support for chain-of-thought or reasoning-based prompting variants","Answer extraction from model responses is regex/string-matching based, not semantic parsing"],"requires":["dev_en.json or dev_zh.json files (5 examples per category)","test_en.json or test_zh.json files (full test set)","Model-specific prompt template (provided examples for Baichuan; others require customization)"],"input_types":["Few-shot example set (5 questions + answers per category)","Test question (single multiple-choice question)","Prompt template string with placeholders"],"output_types":["Constructed prompt string (zero-shot or few-shot variant)","Model response text","Extracted predicted answer (0->A, 1->B, 2->C, 3->D)"],"categories":["planning-reasoning","safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"safetybench-eval__cap_2","uri":"capability://data.processing.analysis.bilingual.dataset.management.and.language.specific.evaluation","name":"bilingual dataset management and language-specific evaluation","description":"Manages parallel Chinese and English datasets (test_en.json, test_zh.json, dev_en.json, dev_zh.json) with a filtered Chinese subset (test_zh_subset.json, 300 questions per category) for sensitive keyword handling. Data acquisition uses Hugging Face hosting with dual download methods (shell script download_data.sh or Python download_data.py with datasets library). Each question maintains consistent structure (id, category, question, options, answer) across languages, enabling direct cross-lingual comparison of model safety performance.","intents":["Evaluate my model's safety in both Chinese and English to ensure consistent alignment across languages","Use the filtered Chinese subset to avoid triggering content policies during evaluation","Download and manage large multilingual datasets efficiently using Hugging Face infrastructure","Compare safety performance deltas between Chinese and English versions of the same model"],"best_for":["Teams building multilingual LLMs (e.g., serving Chinese and English markets)","Researchers studying cross-lingual safety alignment gaps","Organizations needing to evaluate models on sensitive content without triggering policies"],"limitations":["Filtered subset (test_zh_subset.json) removes sensitive keywords, potentially reducing coverage of edge cases that models must handle in production","No automatic translation between Chinese and English — parallel datasets are independently curated, not machine-translated","Language-specific prompt templates must be manually created; no built-in prompt translation","Dataset size is fixed; no dynamic or adaptive language-specific question generation"],"requires":["Python 3.6+ with datasets library (for Python download method)","Bash shell (for shell script download method)","Internet connection to Hugging Face (thu-coai/SafetyBench repository)","~20MB storage for all dataset files"],"input_types":["Hugging Face dataset repository reference (thu-coai/SafetyBench)","Language selection flag (en or zh)"],"output_types":["JSON files: test_en.json, test_zh.json, test_zh_subset.json, dev_en.json, dev_zh.json","Structured question objects with id, category, question, options, answer fields"],"categories":["data-processing-analysis","memory-knowledge"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"safetybench-eval__cap_3","uri":"capability://data.processing.analysis.category.stratified.safety.metric.computation.and.leaderboard.submission","name":"category-stratified safety metric computation and leaderboard submission","description":"Computes accuracy metrics per safety category (offensiveness, unfairness, physical health, mental health, illegal activities, ethics, privacy) and aggregates to an overall safety score. Supports standardized leaderboard submission via JSON format (question_id -> predicted_answer). Metrics are computed by comparing predicted answers (extracted from model responses) against ground-truth labels, enabling fine-grained analysis of which safety dimensions a model excels or fails on. Results can be submitted to llmbench.ai/safety leaderboard for public comparison.","intents":["Identify which safety categories my model is weakest in to prioritize alignment efforts","Submit my model's results to the SafetyBench leaderboard for public benchmarking","Compare my model's per-category safety performance against other models","Track safety improvements across model versions using category-specific metrics"],"best_for":["Teams publishing model safety results and seeking leaderboard rankings","Researchers analyzing safety performance across multiple dimensions","Organizations tracking safety improvements over model iterations"],"limitations":["Metrics are accuracy-based only; no nuance for partial credit or near-misses","No confidence intervals or statistical significance testing built-in","Leaderboard submission requires manual JSON formatting; no automated submission API","No per-question difficulty weighting — all questions treated equally regardless of ambiguity"],"requires":["Predicted answers for all 11,435 questions (or subset being evaluated)","UTF-8 encoded JSON file with format: {question_id: predicted_answer}","Access to llmbench.ai/safety leaderboard for submission"],"input_types":["Predicted answer labels (0->A, 1->B, 2->C, 3->D) for each question","Ground-truth answer labels from dataset","Question category metadata"],"output_types":["Per-category accuracy (0.0-1.0)","Overall safety score (0.0-1.0)","JSON submission file for leaderboard","Leaderboard ranking and public results page"],"categories":["data-processing-analysis","safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"safetybench-eval__cap_4","uri":"capability://automation.workflow.model.evaluation.pipeline.with.answer.extraction.and.validation","name":"model evaluation pipeline with answer extraction and validation","description":"Implements a standardized evaluation pipeline (exemplified in evaluate_baichuan.py) that constructs prompts, sends them to a target model via API or local inference, extracts predicted answers from model responses using model-specific parsing logic, and validates extracted answers against expected format (0->A, 1->B, 2->C, 3->D). The pipeline handles model-specific response formats and can be customized per model architecture. Supports batch evaluation of all 11,435 questions with error handling and logging.","intents":["Evaluate my model on SafetyBench without manually writing evaluation code","Adapt the evaluation pipeline to my model's specific API or inference interface","Batch-evaluate thousands of questions efficiently with error recovery","Extract and validate model answers reliably from diverse response formats"],"best_for":["Teams evaluating proprietary or custom LLMs on SafetyBench","Researchers needing reproducible, standardized evaluation code","Builders integrating SafetyBench into CI/CD pipelines for safety regression testing"],"limitations":["Answer extraction is model-specific and requires manual customization per architecture (e.g., Baichuan vs GPT vs Llama)","No built-in support for streaming responses or token-level probabilities","Error handling is basic; no automatic retry logic for API failures or timeouts","No support for batch API calls; evaluates questions sequentially, which is slow for large models","Assumes model can be called via Python; no support for web-only or proprietary closed-source APIs"],"requires":["Python 3.6+","Model access (local, API endpoint, or cloud service)","Model-specific evaluation script (e.g., evaluate_baichuan.py as template)","API credentials if using cloud-hosted models"],"input_types":["Constructed prompt string (zero-shot or few-shot)","Model API endpoint or local model instance","Model-specific configuration (temperature, max_tokens, etc.)"],"output_types":["Model response text","Extracted predicted answer (0->A, 1->B, 2->C, 3->D)","Validation status (valid/invalid)","JSON results file with all predictions"],"categories":["automation-workflow","tool-use-integration"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"safetybench-eval__cap_5","uri":"capability://safety.moderation.seven.category.safety.taxonomy.and.question.curation","name":"seven-category safety taxonomy and question curation","description":"Defines a structured taxonomy of 7 safety categories (offensiveness, unfairness, physical health, mental health, illegal activities, ethics, privacy) and curates 11,435 diverse multiple-choice questions mapped to these categories. Each question is designed to test whether a model correctly handles or refuses harmful content within that category. The taxonomy is explicit and mutually exclusive, enabling fine-grained safety analysis. Questions are curated to be challenging and representative of real-world safety concerns.","intents":["Understand which safety dimensions my model needs to improve on","Ensure my safety training covers all major harm categories","Use a standardized safety taxonomy for communicating safety properties to stakeholders","Identify gaps in my model's safety coverage across diverse harm types"],"best_for":["Safety researchers studying LLM alignment across multiple harm dimensions","Teams building safety training datasets using SafetyBench as a reference taxonomy","Organizations communicating safety properties to regulators or customers"],"limitations":["Taxonomy is fixed and may not capture emerging or novel safety concerns","No hierarchical structure within categories (e.g., no sub-categories for types of offensiveness)","Question curation process is not fully transparent; no details on how questions were selected or validated","No weighting of categories by severity or real-world impact","Categories may not be equally represented in real-world harms (e.g., illegal activities may be rarer than offensiveness)"],"requires":["Understanding of the 7 safety categories","Access to the curated question dataset"],"input_types":["Safety category label (one of 7)","Question text and options"],"output_types":["Category-specific accuracy metrics","Per-category safety score","Taxonomy-aligned safety report"],"categories":["safety-moderation","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"safetybench-eval__cap_6","uri":"capability://data.processing.analysis.dataset.download.with.hugging.face.integration","name":"dataset download with hugging face integration","description":"Provides two download methods for SafetyBench datasets: shell script (download_data.sh) and Python script (download_data.py using Hugging Face datasets library). The architecture leverages Hugging Face Hub for dataset hosting and distribution, enabling one-command dataset acquisition with automatic decompression and directory structure creation. The Python method uses the datasets library for programmatic access, supporting integration into automated evaluation pipelines without manual file management.","intents":["download full SafetyBench dataset with single command","integrate dataset acquisition into automated evaluation pipelines","cache datasets locally for repeated evaluation runs","access datasets programmatically without manual file downloads"],"best_for":["developers building automated evaluation infrastructure","researchers needing reproducible dataset acquisition","teams with limited manual setup tolerance"],"limitations":["Requires internet connection for initial download; no offline dataset distribution","~20MB dataset size is small but may be slow on very limited bandwidth connections","Hugging Face dependency adds external service dependency; dataset availability depends on Hugging Face uptime","No checksum verification documented; unclear if downloads are validated for integrity"],"requires":["Python 3.6+","Internet connection","Hugging Face datasets library (for Python method)","bash shell (for shell script method)","~20MB disk space"],"input_types":["download method selection (shell or Python)","target directory path (optional)"],"output_types":["downloaded JSON files in data/ directory","directory structure: data/test_en.json, data/test_zh.json, data/dev_en.json, data/dev_zh.json, data/test_zh_subset.json"],"categories":["data-processing-analysis","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"safetybench-eval__cap_7","uri":"capability://data.processing.analysis.category.stratified.evaluation.metrics.computation","name":"category-stratified evaluation metrics computation","description":"Computes accuracy metrics stratified by safety category, enabling per-dimension performance analysis. The evaluation pipeline aggregates predictions across all questions in each category (offensiveness, unfairness, physical health, mental health, illegal activities, ethics, privacy) and computes category-specific accuracy scores. This architecture enables identification of category-specific vulnerabilities (e.g., a model may be robust on ethics but weak on physical health) without requiring separate evaluation runs.","intents":["identify which safety categories a model is weakest on","measure if safety improvements in one category regress performance in others","allocate safety engineering effort to weakest categories","compare category-specific safety profiles across model versions"],"best_for":["safety teams conducting detailed vulnerability analysis","model developers prioritizing safety improvements by category","researchers studying category-specific safety biases"],"limitations":["Category-level metrics mask within-category variance; some categories may have harder/easier questions","No statistical significance testing; unclear if category differences are meaningful or noise","Metrics are accuracy-only; no measure of degree of harm or severity of failures","No confidence intervals or uncertainty quantification per category"],"requires":["Completed predictions for all 11,435 questions","Category labels for each question (provided in dataset)","Python 3.6+ with basic data processing (dict aggregation)"],"input_types":["question predictions (question_id -> answer mapping)","ground truth answers","category labels"],"output_types":["per-category accuracy scores (0-100%)","category-level confusion matrices (optional)","overall accuracy across all categories"],"categories":["data-processing-analysis","safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"safetybench-eval__headline","uri":"capability://safety.moderation.llm.safety.evaluation.benchmark","name":"llm safety evaluation benchmark","description":"SafetyBench is a comprehensive benchmark designed to evaluate the safety capabilities of Large Language Models (LLMs) through 11,435 diverse multiple-choice questions across various safety categories, making it essential for assessing model outputs in sensitive contexts.","intents":["best LLM safety evaluation benchmark","LLM safety evaluation for compliance","how to evaluate LLMs for safety","top benchmarks for LLM safety assessment","safety benchmarks for AI models"],"best_for":["researchers assessing LLM safety","developers ensuring ethical AI outputs"],"limitations":[],"requires":["access to an LLM for evaluation"],"input_types":["multiple-choice questions"],"output_types":["evaluation scores","safety assessments"],"categories":["safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":62,"verified":false,"data_access_risk":"high","permissions":["Python 3.6+","Internet connection to download from Hugging Face (~20MB dataset)","Access to an LLM (local, API-based, or cloud-hosted)","JSON parsing capability for question/answer extraction","dev_en.json or dev_zh.json files (5 examples per category)","test_en.json or test_zh.json files (full test set)","Model-specific prompt template (provided examples for Baichuan; others require customization)","Python 3.6+ with datasets library (for Python download method)","Bash shell (for shell script download method)","Internet connection to Hugging Face (thu-coai/SafetyBench repository)"],"failure_modes":["Multiple-choice format may not capture nuanced safety failures in open-ended generation","Fixed question set limits ability to detect novel or adversarial safety bypasses not in the benchmark","Evaluation requires API access or local model deployment; no built-in support for proprietary closed-source APIs beyond examples","Chinese subset (test_zh_subset.json) is filtered for sensitive keywords, potentially reducing coverage of edge cases","No dynamic or adaptive questioning — cannot follow up on ambiguous model responses","Few-shot examples are fixed (5 per category) — no dynamic example selection based on model behavior","Prompt template customization requires manual intervention per model; no automated prompt optimization","No support for chain-of-thought or reasoning-based prompting variants","Answer extraction from model responses is regex/string-matching based, not semantic parsing","Filtered subset (test_zh_subset.json) removes sensitive keywords, potentially reducing coverage of edge cases that models must handle in production","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.7,"quality":0.8500000000000001,"ecosystem":0.49999999999999994,"match_graph":0.25,"freshness":0.52,"weights":{"adoption":0.25,"quality":0.35,"ecosystem":0.15,"match_graph":0.2,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-06-17T09:51:05.296Z","last_scraped_at":null,"last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=safetybench-eval","compare_url":"https://unfragile.ai/compare?artifact=safetybench-eval"}},"signature":"DZLak4dQ58/yFgsH3Gx3unLFakfkccnUmtmIxSMtuoiA/c6lZDFLVGiaNMgB7JEuER9r7l112o6xWWwTtuBLCw==","signedAt":"2026-06-20T01:59:39.395Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/safetybench-eval","artifact":"https://unfragile.ai/safetybench-eval","verify":"https://unfragile.ai/api/v1/verify?slug=safetybench-eval","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}