Capability
Language Detection And Script Normalization Across 167 Languages
2 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “language-detection-and-script-normalization-across-167-languages”
6.3T token multilingual dataset across 167 languages.
Unique: Applies language detection and script normalization uniformly across all 167 languages using a single model and normalization pipeline, rather than language-specific preprocessing rules that would require 167 separate implementations
vs others: More robust than mC4/OSCAR's language detection by using modern neural models; more comprehensive than single-language datasets by handling script diversity (Latin, Cyrillic, Arabic, CJK, Indic) in a unified pipeline