Capability
3 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →via “tag classification for code understanding and categorization”
Multilingual code evaluation across 17 languages.
Unique: Treats code understanding as a multi-label classification task with semantic tags, providing a structured way to evaluate whether models understand code semantics beyond syntax. Includes tag examples across all 17 languages, enabling cross-language semantic understanding evaluation.
vs others: More structured than open-ended code understanding tasks because it uses predefined semantic tags, and covers more languages (17 vs typically 1-2) than existing code classification benchmarks.
via “multi-language code representation and tokenization”
250GB curated code dataset for StarCoder training.
Unique: Explicitly supports 86 languages with language-aware metadata, enabling models to learn language-specific syntax and patterns. Preserves raw code rather than pre-tokenizing, allowing flexible tokenizer choices downstream.
vs others: Broader language coverage than CodeSearchNet (14 languages) and more flexible than pre-tokenized datasets like Codex, enabling researchers to experiment with different tokenization strategies and language-specific fine-tuning.
Dataset by NTU-NLP-sg. 6,65,024 downloads.
Unique: Provides token-level semantic annotations across multiple programming languages, enabling training of language-agnostic code understanding models through structured prediction — most prior datasets focus on code-level classification rather than fine-grained token-level semantics
vs others: More fine-grained than CodeSearchNet and more multilingual than single-language token classification datasets, enabling training of robust code analyzers across language families
Building an AI tool with “Code Feature Extraction And Token Classification Dataset”?
Submit your artifact →curl unfragile.ai/agents.md | sh© 2026 Unfragile. The platform for software for agents.