Multilingual Safety Evaluation Dataset With Category Stratified Sampling

1

SafetyBench EvalBenchmark63/100

via “multi-category llm safety evaluation via multiple-choice questions”

11K safety evaluation questions across 7 categories.

Unique: Combines 11,435 questions across 7 safety categories with explicit Chinese-English parallel coverage and a filtered subset (test_zh_subset.json) for sensitive keyword handling, enabling systematic cross-lingual safety assessment. Uses category-stratified few-shot examples (5 per category) to support both zero-shot and five-shot evaluation paradigms within a single framework.

vs others: Larger and more category-diverse than single-domain safety benchmarks (e.g., ToxiGen for toxicity only), and explicitly supports Chinese alongside English, addressing a gap in multilingual safety evaluation infrastructure.

2

SafetyBenchBenchmark61/100

via “multilingual safety evaluation dataset with category-stratified sampling”

11K safety evaluation questions across 7 categories.

Unique: Provides parallel Chinese-English safety evaluation with 7-category stratification and category-balanced few-shot examples (5 per category), enabling contrastive safety analysis across languages and fine-grained failure mode diagnosis. Most safety benchmarks (e.g., TruthfulQA, HarmBench) focus on English only or lack structured category decomposition.

vs others: Uniquely covers both Chinese and English with identical category structure, enabling cross-lingual safety parity validation that general-purpose benchmarks like MMLU cannot provide; category-stratified design reveals which safety domains models struggle with rather than aggregate safety scores.

3

CulturaXDataset60/100

via “language-stratified-dataset-composition”

6.3T token multilingual dataset across 167 languages.

Unique: Explicitly exposes language-level composition metadata and enables stratified sampling, whereas mC4 and OSCAR provide language labels but no built-in tools for rebalancing — CulturaX treats language distribution as a first-class concern rather than an afterthought, enabling practitioners to intentionally design inclusive training distributions

vs others: Enables fairer multilingual models than training on raw web distributions (which are ~50% English), and more transparent than datasets that hide language composition, allowing teams to audit and justify their language representation choices

4

UltraChat 200KDataset58/100

via “category-stratified dialogue sampling for balanced training”

200K high-quality multi-turn dialogues for instruction tuning.

Unique: Explicitly structures dataset into three semantic categories (world knowledge, creative, task assistance) with maintained stratification during curation, rather than treating all conversations as undifferentiated — this enables category-aware training strategies and prevents single-domain overfitting

vs others: More structured than generic conversation datasets (e.g., raw Reddit or web scrapes) because category labels enable curriculum learning; more flexible than single-domain datasets because it covers multiple dialogue types in one corpus

5

OpenAssistant Conversations (OASST)Dataset58/100

via “multilingual conversation dataset with 35 language support and cross-lingual sampling”

161K human-written messages in 35 languages with quality ratings.

Unique: Covers 35 languages including low-resource ones (Swahili, Vietnamese, Polish) with human-written conversations, not machine-translated. Enables genuine cross-lingual preference learning rather than synthetic translation.

vs others: Broader language coverage than English-centric datasets (e.g., ShareGPT, HH-RLHF), though with language imbalance requiring careful sampling. Larger low-resource language component than most instruction datasets.

6

ShieldGemmaModel58/100

via “multi-language-safety-classification”

Google's safety content classifiers built on Gemma.

Unique: Gemma's multilingual training enables single-model deployment across 40+ languages with shared safety semantics, avoiding need for language-specific fine-tuned models. Provides per-language confidence adjustments reflecting training data coverage.

vs others: More efficient than maintaining separate safety models per language; more consistent than language-specific classifiers because it uses shared safety semantics across languages

7

Llama GuardModel57/100

via “multilingual safety classification with machine-translated benchmarks”

Meta's LLM safety classifier for content policy enforcement.

Unique: Llama Guard is evaluated against CyberSecEval's machine-translated multilingual benchmark datasets, providing structured coverage of safety risks across languages rather than relying on a single English-trained model applied to translated text.

vs others: More comprehensive than language-agnostic classifiers because it's explicitly tested on multilingual adversarial content, though performance gaps between languages remain due to translation quality and training data imbalance

8

c4Dataset25/100

via “language detection and multilingual corpus stratification”

Dataset by allenai. 7,61,810 downloads.

Unique: C4 provides explicit language detection and stratification for 100+ languages, enabling transparent per-language analysis and balanced sampling. This is more comprehensive than English-only datasets and more transparent than datasets with opaque language composition. The language metadata is included in the dataset, allowing users to audit and adjust language representation.

vs others: C4's language detection and stratification enable true multilingual training and analysis, unlike English-only datasets, while maintaining transparency about language distribution and quality that proprietary multilingual datasets lack.

9

Llama Guard 3 8BModel24/100

via “multi-language safety classification with english-primary accuracy”

Llama Guard 3 is a Llama-3.1-8B pretrained model, fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification)...

Unique: Leverages Llama 3.1's multilingual base model to extend English-optimized safety fine-tuning across 8+ languages through cross-lingual transfer, enabling single-model deployment for global moderation without language-specific retraining

vs others: Simpler operational model than deploying separate language-specific safety classifiers, though with accuracy tradeoffs for non-English languages compared to language-specific fine-tuned models

Top Matches

Also Known As

Company