Capability

Tokenization With Byte Pair Encoding Bpe And Shared Vocabulary

20 artifacts provide this capability.

Want a personalized recommendation?

Top Matches

via “bpe tokenization with 50k vocabulary”

text-generation model by undefined. 1,42,05,413 downloads.

Unique: Standard BPE implementation with 50K vocabulary learned from diverse internet text, providing better coverage for code and technical writing than earlier GPT models but less optimized for non-English languages

vs others: Simpler and faster than SentencePiece (used by T5/mBART) for English text, but less effective for multilingual tasks — GPT-3's tokenizer is proprietary and incompatible

Tokenization With Byte Pair Encoding Bpe And Shared Vocabulary

Top Matches

Also Known As

Company