Multilingual Safety Classification With Machine Translated Benchmarks

1

MTEBBenchmark65/100

via “multilingual and cross-lingual evaluation across 112+ languages”

Embedding model benchmark — 8 tasks, 112 languages, the standard for comparing embeddings.

Unique: Task metadata system stores language codes and domain information as first-class properties, enabling programmatic filtering and cross-lingual task selection. Datasets are loaded with language-aware variants, and the evaluation pipeline preserves language context through metadata propagation. This is distinct from benchmarks that treat language as a post-hoc filtering mechanism.

vs others: Covers 112+ languages with standardized task metadata vs. most embedding benchmarks (e.g., BEIR, STS) which are English-only or have limited multilingual coverage.

2

SafetyBench EvalBenchmark63/100

via “multi-category llm safety evaluation via multiple-choice questions”

11K safety evaluation questions across 7 categories.

Unique: Combines 11,435 questions across 7 safety categories with explicit Chinese-English parallel coverage and a filtered subset (test_zh_subset.json) for sensitive keyword handling, enabling systematic cross-lingual safety assessment. Uses category-stratified few-shot examples (5 per category) to support both zero-shot and five-shot evaluation paradigms within a single framework.

vs others: Larger and more category-diverse than single-domain safety benchmarks (e.g., ToxiGen for toxicity only), and explicitly supports Chinese alongside English, addressing a gap in multilingual safety evaluation infrastructure.

3

SafetyBenchBenchmark61/100

via “multilingual safety evaluation dataset with category-stratified sampling”

11K safety evaluation questions across 7 categories.

Unique: Provides parallel Chinese-English safety evaluation with 7-category stratification and category-balanced few-shot examples (5 per category), enabling contrastive safety analysis across languages and fine-grained failure mode diagnosis. Most safety benchmarks (e.g., TruthfulQA, HarmBench) focus on English only or lack structured category decomposition.

vs others: Uniquely covers both Chinese and English with identical category structure, enabling cross-lingual safety parity validation that general-purpose benchmarks like MMLU cannot provide; category-stratified design reveals which safety domains models struggle with rather than aggregate safety scores.

4

Lakera GuardAPI61/100

via “multilingual threat detection across 100+ languages”

Real-time prompt injection and LLM threat detection API.

Unique: Uses a single unified multilingual model for threat detection across 100+ languages rather than maintaining separate language-specific classifiers, reducing operational complexity and ensuring consistent threat definitions across languages. Automatically handles language detection without explicit configuration.

vs others: More scalable than language-specific detection pipelines (which require managing N models for N languages) and simpler than language detection + routing architectures, though potentially less accurate than specialized language-specific models.

5

ShieldGemmaModel58/100

via “multi-language-safety-classification”

Google's safety content classifiers built on Gemma.

Unique: Gemma's multilingual training enables single-model deployment across 40+ languages with shared safety semantics, avoiding need for language-specific fine-tuned models. Provides per-language confidence adjustments reflecting training data coverage.

vs others: More efficient than maintaining separate safety models per language; more consistent than language-specific classifiers because it uses shared safety semantics across languages

6

Llama GuardModel57/100

via “multilingual safety classification with machine-translated benchmarks”

Meta's LLM safety classifier for content policy enforcement.

Unique: Llama Guard is evaluated against CyberSecEval's machine-translated multilingual benchmark datasets, providing structured coverage of safety risks across languages rather than relying on a single English-trained model applied to translated text.

vs others: More comprehensive than language-agnostic classifiers because it's explicitly tested on multilingual adversarial content, though performance gaps between languages remain due to translation quality and training data imbalance

7

WildGuardDataset57/100

via “evaluation benchmark for safety classifier performance”

Allen AI's safety classification dataset and model.

Unique: Provides multi-dimensional evaluation across 13 harm categories with per-category metrics rather than a single aggregate score, enabling fine-grained analysis of safety classifier performance and identification of specific weaknesses

vs others: More comprehensive than simple accuracy metrics because it includes precision, recall, and ROC-AUC; more actionable than generic benchmarks because it's specific to safety classification and includes category-level breakdowns

8

Prompt GuardModel57/100

via “multilingual prompt injection detection with machine-translated adversarial datasets”

Meta's prompt injection and jailbreak detection classifier.

Unique: Leverages CyberSecEval's multilingual dataset (mitre_prompts_multilingual_machine_translated.json) to provide single-model multilingual detection rather than language-specific classifiers, reducing deployment complexity while acknowledging translation-based limitations

vs others: Single unified model for multiple languages versus maintaining separate classifiers per language; trades off native-speaker accuracy for operational simplicity and consistency

9

MAP-NeoRepository56/100

via “bilingual model evaluation on language-specific benchmarks”

Fully open bilingual model with transparent training.

Unique: Provides integrated bilingual evaluation with language-specific analysis and cross-lingual transfer measurement, whereas most LLM projects evaluate only on English benchmarks or treat languages as separate evaluation tasks

vs others: More comprehensive and language-aware than monolingual evaluation frameworks, and more integrated than standalone multilingual benchmarks by providing bilingual-specific analysis within the training pipeline

10

bge-m3-zeroshot-v2.0Model42/100

via “language-agnostic content moderation”

zero-shot-classification model by undefined. 56,557 downloads.

Unique: Applies zero-shot classification to content moderation across 111 languages simultaneously using a single model, eliminating the need for language-specific rule sets or separate moderation classifiers, and enabling policy category changes without retraining

vs others: Faster to deploy than fine-tuned moderation models and adapts to new violation categories without retraining, though less accurate than supervised classifiers on high-stakes violations; suitable for first-pass filtering rather than final moderation decisions

11

Llama 3.3 (70B)Model25/100

via “multilingual text generation with language-specific safety thresholds”

Meta's latest Llama 3.3 model — advanced reasoning and instruction-following

Unique: Explicitly documents language-specific safety thresholds and discourages unsupported language use without fine-tuning, unlike competitors that silently degrade or provide no guidance on multilingual safety

vs others: More transparent about multilingual limitations than closed-source models, but narrower language support (8 vs 100+) and requires custom fine-tuning for expansion

12

Llama Guard 3 8BModel24/100

via “multi-language safety classification with english-primary accuracy”

Llama Guard 3 is a Llama-3.1-8B pretrained model, fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification)...

Unique: Leverages Llama 3.1's multilingual base model to extend English-optimized safety fine-tuning across 8+ languages through cross-lingual transfer, enabling single-model deployment for global moderation without language-specific retraining

vs others: Simpler operational model than deploying separate language-specific safety classifiers, though with accuracy tradeoffs for non-English languages compared to language-specific fine-tuned models

Top Matches

Also Known As

Company