CodeSearchNet
DatasetFree6M functions across 6 languages paired with documentation.
Capabilities8 decomposed
multi-language code-documentation pair extraction and indexing
Medium confidenceExtracts 6 million function-docstring pairs from public GitHub repositories across Python, Java, JavaScript, PHP, Ruby, and Go using AST parsing and heuristic matching to align code blocks with their associated natural language documentation. The dataset structures these pairs with metadata (repository, file path, function signature) enabling large-scale supervised training of code understanding models. Implementation uses language-specific parsers to identify function boundaries and docstring conventions (docstrings, JSDoc, Javadoc, etc.) with fuzzy matching to handle inconsistent documentation patterns.
Combines AST-based function extraction with docstring heuristic matching across 6 languages in a single unified dataset, enabling cross-language code understanding research. The scale (6M pairs) and multi-language coverage was novel at publication (2019) and influenced the architecture of subsequent code models like CodeBERT which used this dataset for pre-training.
Larger and more diverse than earlier code datasets (e.g., StackOverflow snippets) and includes multiple languages in one benchmark, whereas most prior work focused on single-language datasets or synthetic code-comment pairs.
code search benchmark with relevance ranking evaluation
Medium confidenceProvides a standardized evaluation protocol where code search systems are scored on their ability to rank relevant functions highly when given natural language queries. The benchmark includes query-function pairs with relevance labels derived from the original docstring-code alignment, enabling metrics like Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDCG), and recall@k. Evaluation is performed by computing similarity between query embeddings and code embeddings, then ranking functions by score and comparing against ground-truth relevant functions.
Provides a large-scale (6M function) benchmark with standardized train/test splits and evaluation metrics specifically designed for code search, whereas prior code datasets lacked formal evaluation protocols. The benchmark directly influenced how subsequent code models (CodeBERT, GraphCodeBERT) are evaluated in academic papers.
More comprehensive and language-diverse than earlier code search benchmarks (e.g., CodeSearchNet's predecessor datasets), and includes explicit relevance judgments rather than relying on proxy signals like code similarity or clone detection.
language-specific function boundary detection and extraction
Medium confidenceImplements language-specific AST parsing and heuristic-based extraction to identify function definitions and their associated docstrings across 6 programming languages. For each language, the extraction pipeline uses language-specific conventions: Python (docstrings via triple quotes), Java (Javadoc comments), JavaScript (JSDoc), PHP (PHPDoc), Ruby (YARD/RDoc), and Go (comment blocks). The system handles edge cases like nested functions, decorators, type annotations, and multi-line signatures by leveraging language-specific syntax rules and comment parsing.
Unified extraction pipeline that handles 6 languages with language-specific docstring conventions (docstrings, Javadoc, JSDoc, PHPDoc, YARD, Go comments) in a single codebase, rather than separate language-specific tools. Uses heuristic-based alignment to match docstrings to functions without requiring explicit AST node linking.
More scalable than manual annotation and more robust than regex-based extraction because it uses proper AST parsing for function boundaries, reducing false positives and false negatives compared to string-matching approaches.
pre-computed code and query embeddings for rapid model evaluation
Medium confidenceProvides pre-computed dense vector embeddings for all 6 million functions and associated queries using CodeBERT or similar models, enabling researchers to evaluate new ranking or retrieval strategies without re-embedding the entire dataset. Embeddings are stored in a format optimized for similarity search (e.g., FAISS-compatible vectors), allowing fast nearest-neighbor lookup and ranking without loading the full model. This capability abstracts away the computational cost of embedding generation, making the benchmark accessible to researchers without GPU resources.
Provides pre-computed embeddings for the entire 6M function dataset using a standard model (CodeBERT), enabling rapid evaluation of retrieval algorithms without re-embedding. This was a novel contribution at the time (2019) because prior code datasets did not include pre-computed embeddings, forcing researchers to train embedding models from scratch.
Dramatically reduces the barrier to entry for code search research compared to starting from raw code, and enables fair comparison across methods by using a shared embedding space rather than each team using different models.
train-test split with language-stratified sampling
Medium confidenceProvides standardized train/test/validation splits of the 6 million function-docstring pairs with stratification by programming language to ensure balanced representation across languages in each split. The split strategy maintains the distribution of languages (Python, Java, JavaScript, PHP, Ruby, Go) across train/test sets, preventing models from overfitting to language-specific patterns or achieving inflated performance on high-resource languages. Splits are deterministic and reproducible, enabling fair comparison across research papers and implementations.
Implements language-stratified sampling to ensure balanced representation of all 6 languages in train/test splits, preventing models from overfitting to high-resource languages (Python, Java) at the expense of low-resource languages (Ruby, PHP). This design choice directly influenced how subsequent code datasets (e.g., CodeSearchNet's successors) structure their splits.
More rigorous than random train/test splits because it ensures language distribution is preserved, enabling fair evaluation of multi-language models and preventing spurious performance gains from language-specific biases.
github repository metadata and provenance tracking
Medium confidenceIncludes rich metadata for each function-docstring pair: repository owner, repository name, file path, commit hash, and GitHub URL. This metadata enables researchers to trace extracted functions back to their original source, verify data quality, and analyze code search performance by repository characteristics (e.g., popularity, age, language). The provenance information supports reproducibility and allows researchers to filter or analyze subsets of the dataset based on repository properties (e.g., only functions from popular repositories, or only recent commits).
Includes full GitHub provenance (owner, repo, path, commit) for every function, enabling researchers to trace back to original source and verify data quality. This level of metadata was uncommon in code datasets at the time (2019) and enables reproducibility and auditing.
More transparent and auditable than datasets that strip metadata or anonymize sources, and enables researchers to analyze performance by data source characteristics rather than treating the dataset as a monolithic collection.
multi-language code normalization and standardization
Medium confidenceApplies language-specific normalization rules to code snippets to improve consistency and reduce noise: removing comments (except docstrings), normalizing whitespace, standardizing identifier names, and handling language-specific syntax variations. The normalization is applied consistently across all 6 languages using language-specific rules (e.g., Python indentation, Java access modifiers, JavaScript semicolons), enabling models to focus on semantic patterns rather than syntactic variations. Normalization is optional and can be disabled for use cases requiring original code.
Applies language-specific normalization rules to code across 6 languages in a unified pipeline, rather than using language-agnostic normalization or no normalization at all. This enables models to learn semantic patterns while reducing syntactic noise, improving generalization across different coding styles.
More sophisticated than simple whitespace normalization because it uses language-specific rules (e.g., Python indentation, Java access modifiers) to handle language-specific syntax variations, and more practical than no normalization because it reduces noise without losing semantic information.
multi-language code tokenization and vocabulary
Medium confidenceProvides language-aware tokenization and shared vocabulary for code across 6 programming languages. Tokenization handles language-specific syntax (operators, keywords, delimiters) while creating a unified vocabulary that maps tokens from different languages to shared semantic categories. This enables models to process code from any supported language using a single tokenizer and vocabulary, reducing model complexity and enabling cross-language transfer.
Provides language-aware tokenization with a unified vocabulary across 6 languages, enabling single-model processing of multi-language code. Uses language-specific syntax rules while maintaining semantic equivalence across languages.
Offers a single shared vocabulary for 6 languages, whereas alternatives like separate language-specific tokenizers require multiple models or complex language-switching logic.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with CodeSearchNet, ranked by overlap. Discovered automatically through the match graph.
@13w/local-rag
Distributed semantic memory + code RAG as an MCP plugin for Claude Code agents
grepmax
Semantic code search for coding agents. Local embeddings, LLM summaries, call graph tracing.
CodeT5
Home of CodeT5: Open Code LLMs for Code Understanding and Generation
xCodeEval
Dataset by NTU-NLP-sg. 6,65,024 downloads.
codebasesearch
Ultra-simple code search tool with Jina embeddings, LanceDB, and MCP protocol support
Bloop apps
</details>
Best For
- ✓ML researchers training code understanding models (CodeBERT, GraphCodeBERT, UniXcoder variants)
- ✓Teams building production code search systems who need a standardized benchmark for evaluation
- ✓Organizations fine-tuning pre-trained models on domain-specific code repositories
- ✓Researchers publishing code search papers who need a reproducible benchmark
- ✓Teams evaluating commercial or open-source code search tools (GitHub Copilot, Tabnine, Kite, etc.)
- ✓ML engineers tuning retrieval hyperparameters (embedding dimension, similarity metric, re-ranking strategy)
- ✓Data engineers building training datasets for code understanding models
- ✓Researchers studying code documentation patterns across programming languages
Known Limitations
- ⚠Dataset is static snapshot from GitHub circa 2019 — does not reflect modern code patterns, frameworks, or language versions
- ⚠Docstring quality varies significantly; many functions have minimal or auto-generated documentation, introducing noise for training
- ⚠Extraction heuristics may misalign function boundaries with docstrings in edge cases (nested functions, decorators, complex inheritance)
- ⚠Skewed language distribution — Python and Java dominate; Ruby and PHP are underrepresented relative to real-world usage
- ⚠No temporal information — cannot track how code and documentation evolved or diverged over time
- ⚠Extraction process does not capture context beyond individual functions (imports, class definitions, module-level state)
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
GitHub and Microsoft Research's benchmark dataset for code search containing 6 million functions across 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go) paired with natural language documentation. Both a dataset for training code understanding models and a benchmark for evaluating code search systems. Functions extracted from public GitHub repositories with associated docstrings. Influenced the development of CodeBERT, GraphCodeBERT, and subsequent code understanding models.
Categories
Alternatives to CodeSearchNet
Open-source image generation — SD3, SDXL, massive ecosystem of LoRAs, ControlNets, runs locally.
Compare →Are you the builder of CodeSearchNet?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →