Capability
2 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →via “bilingual dataset management and language-specific evaluation”
11K safety evaluation questions across 7 categories.
Unique: Provides both full Chinese dataset (test_zh.json) and a filtered subset (test_zh_subset.json with 300 questions per category) explicitly designed to avoid sensitive keywords, addressing practical concerns about evaluating on content that may trigger platform policies. Dual download methods (shell script and Python) reduce friction for different user workflows.
vs others: More comprehensive multilingual coverage than English-only benchmarks; filtered subset is a pragmatic addition for teams needing to evaluate without policy violations.
via “chinese-english parallel dataset with sensitive keyword filtering”
11K safety evaluation questions across 7 categories.
Unique: Provides true parallel Chinese-English safety evaluation with identical category structure and question mapping, plus a filtered Chinese subset for regulated environments. Most safety benchmarks (TruthfulQA, HarmBench) are English-only; MMLU-Pro has Chinese but lacks safety focus and category stratification.
vs others: Enables direct cross-lingual safety comparison on identical questions unlike separate English/Chinese benchmarks; filtered subset provides regulatory-compliant evaluation option unavailable in other multilingual safety benchmarks.
Building an AI tool with “Chinese English Parallel Dataset With Sensitive Keyword Filtering”?
Submit your artifact →curl unfragile.ai/agents.md | sh© 2026 Unfragile. The platform for software for agents.