Capability
Language Specific Code Filtering And Sampling
2 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “language-specific code filtering and sampling”
250GB curated code dataset for StarCoder training.
Unique: Provides language-stratified sampling and filtering across 86 languages, enabling researchers to control dataset composition by language. Includes language distribution statistics for informed sampling decisions.
vs others: More flexible than fixed-composition datasets and more comprehensive than language-specific datasets. Enables researchers to study the impact of language diversity on code model performance.