Capability
Multilingual Language Routing Via Mbart Tokenizer
2 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “multilingual tokenization with mbert's shared vocabulary”
token-classification model by undefined. 2,84,856 downloads.
Unique: Uses mBERT's 119K shared vocabulary across 104 languages, enabling unified tokenization without language detection; WordPiece subword segmentation preserves morphological information across language families (e.g., Germanic, Romance, Slavic)
vs others: Simpler than language-specific tokenizer pipelines while maintaining reasonable compression; more consistent across languages than separate tokenizers, reducing entity boundary misalignment