{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"safetybench","slug":"safetybench","name":"SafetyBench","type":"benchmark","url":"https://github.com/thu-coai/SafetyBench","page_url":"https://unfragile.ai/safetybench","categories":["testing-quality"],"tags":[],"pricing":{"model":"free","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"safetybench__cap_0","uri":"capability://safety.moderation.multilingual.safety.evaluation.dataset.with.category.stratified.sampling","name":"multilingual safety evaluation dataset with category-stratified sampling","description":"Provides 11,435 multiple-choice questions across 7 safety categories in parallel Chinese and English versions, with structured JSON schema (id, category, question, options array, answer index) enabling systematic evaluation of LLM safety alignment. Dataset includes full test sets (test_en.json, test_zh.json) and category-balanced few-shot examples (dev_en.json, dev_zh.json with 5 examples per category) for both zero-shot and few-shot evaluation protocols.","intents":["Evaluate whether my LLM correctly refuses harmful requests across diverse safety domains","Compare safety performance of multiple models on identical questions in both languages","Understand which safety categories my model struggles with through fine-grained category-level metrics","Build a safety evaluation pipeline that tests models in both Chinese and English simultaneously"],"best_for":["LLM safety researchers benchmarking alignment across model families","Teams evaluating Chinese-language LLMs where safety is critical (finance, healthcare, government)","Organizations building multilingual AI systems requiring parity safety validation","Academic researchers studying cross-lingual safety generalization"],"limitations":["Multiple-choice format may not capture nuanced safety failures in open-ended generation","11,435 questions is smaller than some general-purpose benchmarks (MMLU has 15,000+), potentially missing long-tail safety edge cases","Dataset is static — does not adapt to emerging safety concerns or adversarial techniques discovered post-publication","Chinese subset (test_zh_subset.json) is filtered for sensitive keywords, potentially biasing evaluation toward detectable rather than subtle safety violations"],"requires":["Python 3.6+","Internet connection for Hugging Face dataset download (~20MB storage)","Hugging Face datasets library (for Python download method) or curl/wget (for shell script method)","Access to an LLM API or local model for evaluation (optional if only exploring dataset)"],"input_types":["JSON schema with question, options, and metadata","Model API endpoints or local model inference interfaces","Prompt templates (zero-shot or few-shot)"],"output_types":["JSON predictions mapping question IDs to answer indices (0-3 for A-D)","Category-level accuracy metrics and safety scores","Leaderboard-compatible submission format"],"categories":["safety-moderation","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"safetybench__cap_1","uri":"capability://safety.moderation.zero.shot.and.few.shot.evaluation.protocol.with.prompt.templating","name":"zero-shot and few-shot evaluation protocol with prompt templating","description":"Implements dual evaluation modes where zero-shot presents questions directly without context, while five-shot provides 5 category-matched examples before each test question. System uses configurable prompt templates that can be adapted per-model (as shown in evaluate_baichuan.py) to optimize answer extraction from model outputs, supporting both structured and free-form response parsing.","intents":["Test whether my model's safety alignment is robust without in-context examples (zero-shot)","Measure how much in-context safety examples improve model performance (few-shot delta)","Adapt evaluation prompts for models with different output formatting preferences","Compare model safety across evaluation settings to detect prompt-sensitivity vulnerabilities"],"best_for":["Researchers studying in-context learning effects on safety alignment","Teams evaluating models with non-standard output formats requiring custom prompt engineering","Organizations benchmarking models across different prompt sensitivity profiles","Safety auditors detecting prompt-injection vulnerabilities through evaluation-setting variance"],"limitations":["Five-shot examples are fixed per category — does not support dynamic example selection based on model performance or adversarial difficulty","Prompt templates require manual tuning per model family; no automated prompt optimization framework provided","Few-shot evaluation assumes model can reliably extract answers from 5 examples; models with poor in-context learning may show artificially low few-shot scores","No support for chain-of-thought or reasoning-based evaluation — only direct answer extraction"],"requires":["Python 3.6+","Model API with text generation capability or local inference interface","Prompt template strings (provided defaults or custom)","Answer extraction logic (regex or model-specific parsing)"],"input_types":["Question text + options array","Few-shot examples (5 per category for five-shot mode)","Prompt template strings with variable placeholders"],"output_types":["Model-generated text response","Extracted answer index (0-3)","Accuracy metrics per evaluation setting"],"categories":["safety-moderation","planning-reasoning"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"safetybench__cap_2","uri":"capability://safety.moderation.category.stratified.safety.metric.aggregation.and.leaderboard.submission","name":"category-stratified safety metric aggregation and leaderboard submission","description":"Aggregates model predictions into per-category accuracy scores across 7 safety domains, enabling fine-grained safety failure analysis beyond aggregate metrics. Leaderboard submission accepts UTF-8 JSON files mapping question IDs to predicted answer indices, with backend validation and ranking against baseline models. Architecture supports both English and Chinese evaluation tracks with separate leaderboards.","intents":["Identify which safety categories my model is weakest in (e.g., illegal activity vs. bias)","Compare my model's category-level safety profile against published baselines on the leaderboard","Submit evaluation results to the official SafetyBench leaderboard for peer comparison","Diagnose whether safety failures are systematic (one category) or distributed (all categories)"],"best_for":["Safety researchers publishing model evaluations and seeking peer comparison","Teams building safety-critical LLMs needing category-level diagnostics","Organizations tracking safety improvements across model versions","Leaderboard participants competing on multilingual safety benchmarks"],"limitations":["Leaderboard submission requires manual JSON file preparation — no automated submission API provided","Category-level metrics are computed post-hoc from predictions; no real-time feedback during evaluation","Leaderboard does not publish per-question failure analysis — only aggregate category scores","No support for weighted category scoring; all categories treated equally despite potential real-world importance differences"],"requires":["Python 3.6+ for metric computation","UTF-8 encoded JSON file with format: {question_id: answer_index}","Access to llmbench.ai/safety leaderboard submission portal","Model predictions for all 11,435 questions (or subset for partial evaluation)"],"input_types":["Model predictions as JSON mapping (question_id -> answer_index)","Ground truth labels from dataset (answer field)"],"output_types":["Per-category accuracy scores (7 categories)","Aggregate safety score","Leaderboard ranking and comparison against baselines","Category-level performance breakdown"],"categories":["safety-moderation","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"safetybench__cap_3","uri":"capability://data.processing.analysis.hugging.face.dataset.integration.with.dual.download.methods","name":"hugging face dataset integration with dual download methods","description":"Provides two data acquisition paths: shell script (download_data.sh) using curl/wget for direct Hugging Face download, and Python method (download_data.py) using the Hugging Face datasets library for programmatic access. Both methods download 6 JSON files (test_en.json, test_zh.json, test_zh_subset.json, dev_en.json, dev_zh.json) into a local data directory, with automatic decompression and validation.","intents":["Download SafetyBench dataset without installing Python dependencies (shell script method)","Integrate SafetyBench into Python evaluation pipelines using the datasets library","Automate dataset updates when new versions are published on Hugging Face","Cache dataset locally for offline evaluation without repeated downloads"],"best_for":["DevOps engineers setting up evaluation infrastructure via shell scripts","Python developers building integrated evaluation pipelines","Teams with restricted internet access needing one-time bulk download","Researchers automating benchmark evaluation across multiple models"],"limitations":["Shell script method requires curl/wget and bash — not portable to Windows without WSL or Git Bash","Python method adds dependency on Hugging Face datasets library (requires additional pip install)","No built-in checksum validation — cannot verify dataset integrity post-download","Dataset size (~20MB) is small but requires internet connectivity; no offline distribution method provided"],"requires":["For shell script: bash, curl or wget, ~20MB disk space","For Python method: Python 3.6+, huggingface-hub or datasets library","Internet connection for initial download","Write permissions to create data/ directory"],"input_types":["Hugging Face dataset repository URL (thu-coai/SafetyBench)","Download method selection (shell or Python)"],"output_types":["6 JSON files in data/ directory: test_en.json, test_zh.json, test_zh_subset.json, dev_en.json, dev_zh.json","Structured dataset objects (if using Python datasets library)"],"categories":["data-processing-analysis","tool-use-integration"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"safetybench__cap_4","uri":"capability://safety.moderation.chinese.english.parallel.dataset.with.sensitive.keyword.filtering","name":"chinese-english parallel dataset with sensitive keyword filtering","description":"Maintains three parallel test datasets: full English (test_en.json), full Chinese (test_zh.json), and filtered Chinese subset (test_zh_subset.json with 300 questions per category, filtered for sensitive keywords). Each question maintains identical structure and category mapping across languages, enabling direct cross-lingual comparison while test_zh_subset provides a safer evaluation option for sensitive deployment contexts.","intents":["Evaluate my model's safety alignment in both Chinese and English on identical questions","Test whether safety alignment transfers across languages or exhibits language-specific vulnerabilities","Use the filtered Chinese subset for safety evaluation in regulated environments (finance, government)","Measure cross-lingual safety parity to detect language-specific alignment gaps"],"best_for":["Teams building multilingual LLMs requiring cross-lingual safety validation","Chinese-language model developers needing safety benchmarks in their primary language","Organizations in regulated industries using the filtered subset to avoid sensitive content","Researchers studying cross-lingual safety generalization and transfer"],"limitations":["Filtered Chinese subset (300 questions per category) is 2.1x smaller than full test sets (~1,635 vs ~3,500 per language), reducing statistical power for category-level analysis","Sensitive keyword filtering is heuristic-based — may remove legitimate safety questions or miss subtle harmful content","No explicit alignment between English and Chinese questions — parallel structure assumed but not validated","Filtering criteria not documented — unclear which keywords trigger removal, limiting reproducibility"],"requires":["Python 3.6+ for processing JSON files","Chinese language support in evaluation environment (UTF-8 encoding)","Model capable of processing both English and Chinese text"],"input_types":["Question text in English or Chinese","Category labels (identical across languages)","Options arrays (translated to match language)"],"output_types":["Per-language accuracy scores","Cross-lingual safety parity metrics","Language-specific failure analysis"],"categories":["safety-moderation","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"safetybench__cap_5","uri":"capability://safety.moderation.7.category.safety.taxonomy.with.fine.grained.failure.mode.classification","name":"7-category safety taxonomy with fine-grained failure mode classification","description":"Organizes 11,435 questions into 7 distinct safety categories (specific categories not detailed in provided docs but implied by category field in JSON schema), enabling systematic analysis of which safety domains models fail in. Each question is tagged with a category label, allowing per-category accuracy computation and identification of domain-specific alignment gaps. Category-balanced few-shot examples (5 per category) support category-specific evaluation.","intents":["Understand which safety domains my model is weakest in (e.g., illegal activity vs. bias vs. misinformation)","Prioritize safety improvements by identifying the highest-impact failure categories","Analyze whether safety failures are systematic (one category) or distributed across domains","Build category-specific safety interventions targeting model weaknesses"],"best_for":["Safety researchers diagnosing model-specific safety vulnerabilities","Teams building safety-critical systems needing targeted alignment improvements","Organizations prioritizing safety work based on real-world impact of failure categories","Researchers studying whether safety alignment generalizes across domains"],"limitations":["7 categories may be too coarse-grained for some applications — subcategories not provided","Category definitions not explicitly documented in provided materials — unclear what each category covers","No weighting by real-world harm severity — all categories treated equally despite potential importance differences","Category-level sample sizes (~1,635 per category) may be insufficient for statistical significance in low-accuracy regimes"],"requires":["Dataset with category field populated for all questions","Evaluation script supporting category-level metric aggregation","Python 3.6+ for computing per-category statistics"],"input_types":["Question with category label","Model prediction for that question"],"output_types":["Per-category accuracy scores (7 values)","Category-level failure analysis","Category-specific improvement recommendations"],"categories":["safety-moderation","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"safetybench__headline","uri":"capability://safety.moderation.benchmark.for.evaluating.safety.in.large.language.models","name":"benchmark for evaluating safety in large language models","description":"SafetyBench is a comprehensive benchmark designed to evaluate the safety capabilities of Large Language Models (LLMs) through a diverse set of 11,435 multiple-choice questions across 7 safety categories in both Chinese and English.","intents":["best safety benchmark for LLMs","LLM safety evaluation tools","how to assess safety in language models","top benchmarks for model safety","safety evaluation frameworks for AI"],"best_for":["researchers assessing LLM safety","developers testing AI model outputs"],"limitations":["limited to safety evaluation","requires access to an LLM for testing"],"requires":["Python 3.6 or higher","internet connection"],"input_types":["multiple-choice questions"],"output_types":["safety evaluation results"],"categories":["safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":61,"verified":false,"data_access_risk":"high","permissions":["Python 3.6+","Internet connection for Hugging Face dataset download (~20MB storage)","Hugging Face datasets library (for Python download method) or curl/wget (for shell script method)","Access to an LLM API or local model for evaluation (optional if only exploring dataset)","Model API with text generation capability or local inference interface","Prompt template strings (provided defaults or custom)","Answer extraction logic (regex or model-specific parsing)","Python 3.6+ for metric computation","UTF-8 encoded JSON file with format: {question_id: answer_index}","Access to llmbench.ai/safety leaderboard submission portal"],"failure_modes":["Multiple-choice format may not capture nuanced safety failures in open-ended generation","11,435 questions is smaller than some general-purpose benchmarks (MMLU has 15,000+), potentially missing long-tail safety edge cases","Dataset is static — does not adapt to emerging safety concerns or adversarial techniques discovered post-publication","Chinese subset (test_zh_subset.json) is filtered for sensitive keywords, potentially biasing evaluation toward detectable rather than subtle safety violations","Five-shot examples are fixed per category — does not support dynamic example selection based on model performance or adversarial difficulty","Prompt templates require manual tuning per model family; no automated prompt optimization framework provided","Few-shot evaluation assumes model can reliably extract answers from 5 examples; models with poor in-context learning may show artificially low few-shot scores","No support for chain-of-thought or reasoning-based evaluation — only direct answer extraction","Leaderboard submission requires manual JSON file preparation — no automated submission API provided","Category-level metrics are computed post-hoc from predictions; no real-time feedback during evaluation","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.7,"quality":0.8500000000000001,"ecosystem":0.39999999999999997,"match_graph":0.25,"freshness":0.52,"weights":{"adoption":0.25,"quality":0.35,"ecosystem":0.15,"match_graph":0.2,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-06-17T09:51:05.296Z","last_scraped_at":null,"last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=safetybench","compare_url":"https://unfragile.ai/compare?artifact=safetybench"}},"signature":"M8DwpMDe4V0eTzfmsBeqoX7sD1PDHEgHAi6LuqSo+DZf+/NSkQcAtQ+RwSsL3kKbE46JMNbWBT7i33gjdrjGCw==","signedAt":"2026-06-20T22:35:30.267Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/safetybench","artifact":"https://unfragile.ai/safetybench","verify":"https://unfragile.ai/api/v1/verify?slug=safetybench","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}