Capability
Code Feature Extraction And Token Classification Dataset
3 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “multi-language code representation and tokenization”
250GB curated code dataset for StarCoder training.
Unique: Explicitly supports 86 languages with language-aware metadata, enabling models to learn language-specific syntax and patterns. Preserves raw code rather than pre-tokenizing, allowing flexible tokenizer choices downstream.
vs others: Broader language coverage than CodeSearchNet (14 languages) and more flexible than pre-tokenized datasets like Codex, enabling researchers to experiment with different tokenization strategies and language-specific fine-tuning.