xCodeEval

Q: What can xCodeEval do?

multilingual code-to-code translation dataset construction, code clone detection dataset with multilingual support, code search and retrieval dataset with natural language queries, code question-answering dataset with multilingual code context, code feature extraction and token classification dataset, multilingual code representation learning through contrastive pairs, code-to-text generation dataset for documentation and explanation

DatasetFree

Dataset by NTU-NLP-sg. 6,96,087 downloads.

Open Source

/ 100

7 capabilities

Capabilities7 decomposed

multilingual code-to-code translation dataset construction

Medium confidence

Provides 696,087 expert-annotated code translation pairs across multiple programming languages, enabling training of models to translate code semantically between languages while preserving functionality. The dataset uses expert-generated annotations to ensure translation quality and includes both source code and target translations with language-pair coverage, allowing models to learn cross-language code semantics through supervised learning on diverse programming paradigms.

Solves for

Train code translation models that convert legacy code from one language to another while maintaining functionalityBuild multilingual code understanding systems that work across Java, Python, C++, JavaScript, and other languagesEvaluate code-to-code translation quality and semantic preservation across language boundariesCreate benchmarks for assessing how well LLMs understand code equivalence across programming languages

Best for

ML researchers training code translation models

Teams building cross-language code migration tools

Developers evaluating multilingual code LLM performance

Requires

HuggingFace Datasets library (transformers>=4.0)

Python 3.7+

Sufficient disk space for 1M-10M size category dataset (~2-5GB estimated)

Limitations

Expert annotations may reflect specific translation preferences and idioms, not all valid translations

Dataset size (696K examples) may be insufficient for training very large models on all language pairs equally

Language pair coverage is uneven — some language combinations may have significantly fewer examples than others

What makes it unique

Combines expert-generated annotations with found code sources to create 696K+ translation pairs across 6+ programming languages, using token-classification and text-retrieval task formulations to enable both fine-grained alignment learning and semantic matching — a scale and diversity not matched by earlier code translation datasets

vs alternatives

Larger and more diverse than CodeXGLUE's translation subset and includes expert validation of translation quality, whereas most prior datasets rely on automated alignment or single-language-pair focus

code clone detection dataset with multilingual support

Medium confidence

Provides annotated pairs of semantically equivalent code snippets across multiple programming languages, enabling training of models to detect code clones and semantic similarity. The dataset uses expert classification to identify true semantic equivalence versus syntactic similarity, allowing models to learn language-agnostic code representations through contrastive or classification-based approaches on code pairs with varying levels of structural and semantic overlap.

Solves for

Train code clone detection models that identify semantically equivalent code across language boundariesBuild code similarity search systems that find functionally equivalent implementations regardless of languageEvaluate how well code embeddings capture semantic equivalence across programming paradigmsCreate benchmarks for assessing code deduplication and plagiarism detection systems

Best for

Security researchers building code plagiarism detection systems

ML engineers training code embedding models

Teams managing large polyglot codebases needing deduplication

Requires

HuggingFace Datasets library

Python 3.7+

Code parsing/tokenization tools for target languages

Limitations

Expert annotations reflect human judgment of equivalence, which may not align with all valid interpretations of 'semantic equivalence'

Clone detection focuses on function/method-level granularity; may not capture equivalence at statement or expression level

Dataset may have annotation bias toward certain programming styles or idioms common in the expert annotators' experience

What makes it unique

Combines cross-language code pairs with expert-validated semantic equivalence labels, enabling training of language-agnostic clone detectors through token-classification and text-retrieval formulations — most prior clone detection datasets focus on single-language or syntactic similarity

vs alternatives

Provides multilingual clone pairs with expert validation, whereas BigCloneBench focuses on Java-only clones and POJ-104 uses only syntactic matching without semantic validation

code search and retrieval dataset with natural language queries

Medium confidence

Provides paired code snippets and natural language descriptions/queries, enabling training of code search models that retrieve relevant code given natural language intent. The dataset uses expert-generated descriptions and found code to create query-code pairs, allowing models to learn the mapping between natural language semantics and code implementation through text-retrieval and feature-extraction tasks on multilingual code.

Solves for

Train code search engines that find relevant implementations given natural language problem descriptionsBuild semantic code search systems that understand intent beyond keyword matchingEvaluate how well code-language models understand the relationship between documentation and implementationCreate benchmarks for assessing code retrieval quality in IDE plugins and code recommendation systems

Best for

Teams building code search and recommendation features in IDEs

ML researchers training code-language alignment models

Organizations implementing internal code discovery systems

Requires

HuggingFace Datasets library

Python 3.7+

Text embedding and code embedding models (e.g., CodeBERT, GraphCodeBERT)

Limitations

Natural language descriptions may be incomplete or ambiguous, not capturing all nuances of code behavior

Query-code pairing may reflect specific documentation styles and may not generalize to all code description patterns

Dataset size may be insufficient for training very large retrieval models with many language pairs

What makes it unique

Combines expert-generated natural language descriptions with found code across multiple languages, using text-retrieval formulations to enable training of semantic code search models — integrates both code-to-code and code-to-language alignment in a single dataset

vs alternatives

Larger and more multilingual than CodeSearchNet and includes expert-validated descriptions, whereas CodeSearchNet relies on mined documentation and focuses primarily on English

code question-answering dataset with multilingual code context

Medium confidence

Provides code snippets paired with natural language questions and expert-generated answers about code behavior, enabling training of models to answer questions about code functionality and semantics. The dataset uses question-answering and text-generation task formulations to train models to understand code and generate natural language explanations, supporting both extractive and abstractive answer generation across multiple programming languages.

Solves for

Train code understanding models that answer questions about what code does and whyBuild AI assistants that explain code behavior and help developers understand unfamiliar implementationsEvaluate how well code LLMs understand code semantics and can articulate behavior in natural languageCreate benchmarks for code explanation and documentation generation systems

Best for

Teams building code explanation features in IDEs and documentation tools

ML researchers training code understanding and generation models

Organizations creating AI-powered code review and documentation systems

Requires

HuggingFace Datasets library

Python 3.7+

Code parsing and AST analysis tools for target languages

Limitations

Expert answers may reflect specific interpretation of code behavior and may not capture all valid explanations

Questions may be biased toward certain types of code understanding (e.g., control flow vs. data flow)

Answer quality depends on expert annotator expertise in each programming language

What makes it unique

Combines code snippets with expert-generated question-answer pairs across multiple languages, enabling training of code understanding models through both extractive and abstractive QA formulations — integrates code comprehension with natural language generation in a multilingual context

vs alternatives

Broader scope than CoQA (conversational QA on text) applied to code, and more multilingual than CodeQA which focuses primarily on Java and Python

code feature extraction and token classification dataset

Medium confidence

Provides code snippets with expert-generated token-level annotations for semantic features (e.g., variable scope, function calls, data flow), enabling training of models to identify and classify code elements. The dataset uses token-classification and feature-extraction task formulations to train models to understand fine-grained code structure and semantics, supporting both sequence labeling and structured prediction approaches on multilingual code.

Solves for

Train code understanding models that identify semantic elements like variable definitions, function calls, and data dependenciesBuild code analysis tools that extract structured information from code for refactoring and optimizationEvaluate how well code models understand code structure and semantics at the token levelCreate benchmarks for code parsing and semantic analysis across programming languages

Best for

Teams building code analysis and refactoring tools

ML researchers training code understanding models with fine-grained annotations

Organizations implementing code quality and security analysis systems

Requires

HuggingFace Datasets library

Python 3.7+

Tokenization tools for target programming languages

Limitations

Token-level annotations require consistent annotation guidelines across languages, which may introduce bias

Annotation granularity may not capture all relevant semantic features or may over-segment code

Different programming languages have different token structures, making cross-language learning challenging

What makes it unique

Provides token-level semantic annotations across multiple programming languages, enabling training of language-agnostic code understanding models through structured prediction — most prior datasets focus on code-level classification rather than fine-grained token-level semantics

vs alternatives

More fine-grained than CodeSearchNet and more multilingual than single-language token classification datasets, enabling training of robust code analyzers across language families

multilingual code representation learning through contrastive pairs

Medium confidence

Provides code pairs with varying degrees of semantic and syntactic similarity across multiple programming languages, enabling training of code embedding models through contrastive learning approaches. The dataset uses both positive pairs (semantically equivalent code) and negative pairs (dissimilar code) to train models to learn language-agnostic code representations that capture semantic similarity while being invariant to syntactic variation and language choice.

Solves for

Train code embedding models that produce similar representations for semantically equivalent code across languagesBuild code similarity and clustering systems that work across programming language boundariesEvaluate how well code embeddings capture semantic equivalence independent of language and syntaxCreate benchmarks for assessing code representation quality on multilingual code understanding tasks

Best for

ML researchers training code embedding and representation learning models

Teams building code clustering and deduplication systems

Organizations implementing semantic code search across polyglot codebases

Requires

HuggingFace Datasets library

Python 3.7+

Contrastive learning frameworks (e.g., SimCLR, Triplet Loss implementations)

Limitations

Contrastive pair selection may introduce bias toward certain types of equivalence and miss valid alternatives

Negative pair selection strategy significantly impacts learning quality but may not be optimal for all downstream tasks

Embedding quality depends on model architecture and training procedure, not just dataset quality

What makes it unique

Provides expert-validated positive and negative code pairs across multiple languages for contrastive learning, enabling training of language-agnostic code embeddings that capture semantic equivalence — combines scale (696K+ pairs) with multilingual diversity and expert validation

vs alternatives

Larger and more diverse than CodeSearchNet's contrastive pairs and includes explicit negative examples, whereas most prior datasets rely on mined or automatically-aligned pairs without expert validation

code-to-text generation dataset for documentation and explanation

Medium confidence

Provides code snippets paired with expert-generated natural language descriptions and documentation, enabling training of models to generate documentation and explanations from code. The dataset uses text-generation task formulations to train models to understand code semantics and produce coherent, accurate natural language descriptions, supporting both abstractive summarization and detailed explanation generation across multiple programming languages.

Solves for

Train code-to-documentation models that automatically generate docstrings and API documentation from codeBuild code explanation systems that produce human-readable descriptions of code functionalityEvaluate how well code-language models understand code and can articulate behavior in natural languageCreate benchmarks for automatic documentation generation and code summarization systems

Best for

Teams building automatic documentation generation tools

ML researchers training code-to-text generation models

Organizations implementing code quality and documentation standards

Requires

HuggingFace Datasets library

Python 3.7+

Text generation models (e.g., CodeT5, BART, T5)

Limitations

Expert-generated descriptions may reflect specific documentation styles and may not generalize to all code description patterns

Text generation quality depends heavily on model architecture and training procedure, not just dataset quality

Descriptions may be incomplete or may not capture all nuances of code behavior

What makes it unique

Combines code snippets with expert-generated natural language descriptions across multiple languages, enabling training of code-to-text models through abstractive and detailed generation formulations — integrates code understanding with natural language generation at scale

vs alternatives

More multilingual and larger than CodeSearchNet's code-to-documentation pairs and includes expert-validated descriptions, whereas most prior datasets rely on mined documentation or single-language focus

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with xCodeEval, ranked by overlap. Discovered automatically through the match graph.

Dataset46

CodeSearchNet

6M functions across 6 languages paired with documentation.

code clone and similarity detection datasetcode understanding model training corpuscode-to-documentation paired dataset creationmulti-language code tokenization and vocabulary

4 shared capabilities

Dataset45

xCodeEval

Multilingual code evaluation across 17 languages.

code-to-code retrieval with semantic similarity matchingnatural language to code retrieval with semantic matching

2 shared capabilities

Dataset48

The Stack v2

67 TB permissively licensed code dataset across 600+ languages.

multi-language source code normalization and deduplication600+ programming language support with language-specific metadata

2 shared capabilities

Repository44

CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation

text-to-code retrieval with cross-lingual matching

1 shared capability

Dataset45

StarCoderData

250GB curated code dataset for StarCoder training.

multi-language code dataset curation with near-deduplication

1 shared capability

Model45

DeepSeek V3

671B MoE model matching GPT-4o at fraction of training cost.

multi-language support across 40+ programming languages and natural languages

1 shared capability

Best For

✓ML researchers training code translation models
✓Teams building cross-language code migration tools
✓Developers evaluating multilingual code LLM performance
✓Organizations standardizing legacy codebases across multiple languages
✓Security researchers building code plagiarism detection systems
✓ML engineers training code embedding models
✓Teams managing large polyglot codebases needing deduplication
✓Researchers studying code semantics and language-agnostic representations

Known Limitations

⚠Expert annotations may reflect specific translation preferences and idioms, not all valid translations
⚠Dataset size (696K examples) may be insufficient for training very large models on all language pairs equally
⚠Language pair coverage is uneven — some language combinations may have significantly fewer examples than others
⚠Annotations are static snapshots and don't capture evolving language features or modern idioms introduced after dataset creation
⚠Expert annotations reflect human judgment of equivalence, which may not align with all valid interpretations of 'semantic equivalence'
⚠Clone detection focuses on function/method-level granularity; may not capture equivalence at statement or expression level

Requirements

HuggingFace Datasets library (transformers>=4.0)Python 3.7+Sufficient disk space for 1M-10M size category dataset (~2-5GB estimated)Understanding of target programming languages for meaningful evaluationHuggingFace Datasets libraryCode parsing/tokenization tools for target languagesFamiliarity with code similarity metrics and embedding approachesText embedding and code embedding models (e.g., CodeBERT, GraphCodeBERT)

Input / Output

Accepts: source code (Java, Python, C++, JavaScript, Go, Rust, etc.), code snippets with function/method scope, code with inline comments and documentation, code snippet pairs (same or different languages), function/method implementations, code with variable names and comments, natural language queries (English and potentially other languages), code snippets in multiple programming languages, function signatures and documentation strings, natural language questions about code behavior, function signatures and surrounding context, code snippets with token-level annotations, code in multiple programming languages, function/method implementations with context, code snippet pairs (positive and negative examples), function/method implementations with varying complexity, function/method implementations with signatures, code with variable names and inline comments

Produces: translated code in target language, structured metadata (language pair, translation quality scores), token-level alignment annotations for fine-grained analysis, binary classification labels (equivalent/not equivalent), similarity scores or confidence levels, token-level alignment for fine-grained analysis, ranked lists of relevant code snippets, relevance scores or similarity metrics, embedding vectors for code and queries, natural language answers explaining code behavior, extractive spans from code or documentation, structured explanations of code semantics, token-level semantic labels (variable, function, keyword, etc.), structured code features (scope, type, data flow), BIO/BIOES tagged sequences for sequence labeling, embedding vectors for code snippets, similarity scores between code pairs, clustering assignments based on semantic equivalence, natural language descriptions and docstrings, API documentation and usage examples, code summaries and explanations

UnfragileRank

Adoption15%(35% weight)

Quality16%(25% weight)

Ecosystem60%(20% weight)

Match Graph10%(15% weight)

Freshness75%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Dataset

7 capabilities

Visit xCodeEval→

About

xCodeEval — a dataset on HuggingFace with 6,96,087 downloads

Alternatives to xCodeEval

wink-embeddings-sg-100d24Repository

100-dimensional English word embeddings for wink-nlp

Compare →

voyage-ai-provider30API

Voyage AI Provider for running Voyage AI models with Vercel AI SDK

Compare →

@vibe-agent-toolkit/rag-lancedb27Agent

LanceDB implementation of RAG interfaces for vibe-agent-toolkit

Compare →

vectra41Repository

A lightweight, file-backed vector database for Node.js and browsers with Pinecone-compatible filtering and hybrid BM25 search.

Compare →

Are you the builder of xCodeEval?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

huggingface

Looking for something else?

Search →

Capabilities7 decomposed

multilingual code-to-code translation dataset construction

Medium confidence

Solves for

Best for

ML researchers training code translation models

Teams building cross-language code migration tools

Developers evaluating multilingual code LLM performance

Requires

HuggingFace Datasets library (transformers>=4.0)

Python 3.7+

Sufficient disk space for 1M-10M size category dataset (~2-5GB estimated)

Limitations

Expert annotations may reflect specific translation preferences and idioms, not all valid translations

Dataset size (696K examples) may be insufficient for training very large models on all language pairs equally

Language pair coverage is uneven — some language combinations may have significantly fewer examples than others

What makes it unique

vs alternatives

code clone detection dataset with multilingual support

Medium confidence

Solves for

Best for

Security researchers building code plagiarism detection systems

ML engineers training code embedding models

Teams managing large polyglot codebases needing deduplication

Requires

HuggingFace Datasets library

Python 3.7+

Code parsing/tokenization tools for target languages

Limitations

Expert annotations reflect human judgment of equivalence, which may not align with all valid interpretations of 'semantic equivalence'

Clone detection focuses on function/method-level granularity; may not capture equivalence at statement or expression level

Dataset may have annotation bias toward certain programming styles or idioms common in the expert annotators' experience

What makes it unique

vs alternatives

Provides multilingual clone pairs with expert validation, whereas BigCloneBench focuses on Java-only clones and POJ-104 uses only syntactic matching without semantic validation

code search and retrieval dataset with natural language queries

Medium confidence

Solves for

Best for

Teams building code search and recommendation features in IDEs

ML researchers training code-language alignment models

Organizations implementing internal code discovery systems

Requires

HuggingFace Datasets library

Python 3.7+

Text embedding and code embedding models (e.g., CodeBERT, GraphCodeBERT)

Limitations

Natural language descriptions may be incomplete or ambiguous, not capturing all nuances of code behavior

Query-code pairing may reflect specific documentation styles and may not generalize to all code description patterns

Dataset size may be insufficient for training very large retrieval models with many language pairs

What makes it unique

vs alternatives

Larger and more multilingual than CodeSearchNet and includes expert-validated descriptions, whereas CodeSearchNet relies on mined documentation and focuses primarily on English

code question-answering dataset with multilingual code context

Medium confidence

Solves for

Best for

Teams building code explanation features in IDEs and documentation tools

ML researchers training code understanding and generation models

Organizations creating AI-powered code review and documentation systems

Requires

HuggingFace Datasets library

Python 3.7+

Code parsing and AST analysis tools for target languages

Limitations

Expert answers may reflect specific interpretation of code behavior and may not capture all valid explanations

Questions may be biased toward certain types of code understanding (e.g., control flow vs. data flow)

Answer quality depends on expert annotator expertise in each programming language

What makes it unique

vs alternatives

Broader scope than CoQA (conversational QA on text) applied to code, and more multilingual than CodeQA which focuses primarily on Java and Python

code feature extraction and token classification dataset

Medium confidence

Solves for

Best for

Teams building code analysis and refactoring tools

ML researchers training code understanding models with fine-grained annotations

Organizations implementing code quality and security analysis systems

Requires

HuggingFace Datasets library

Python 3.7+

Tokenization tools for target programming languages

Limitations

Token-level annotations require consistent annotation guidelines across languages, which may introduce bias

Annotation granularity may not capture all relevant semantic features or may over-segment code

Different programming languages have different token structures, making cross-language learning challenging

What makes it unique

vs alternatives

More fine-grained than CodeSearchNet and more multilingual than single-language token classification datasets, enabling training of robust code analyzers across language families

multilingual code representation learning through contrastive pairs

Medium confidence

Solves for

Best for

ML researchers training code embedding and representation learning models

Teams building code clustering and deduplication systems

Organizations implementing semantic code search across polyglot codebases

Requires

HuggingFace Datasets library

Python 3.7+

Contrastive learning frameworks (e.g., SimCLR, Triplet Loss implementations)

Limitations

Contrastive pair selection may introduce bias toward certain types of equivalence and miss valid alternatives

Negative pair selection strategy significantly impacts learning quality but may not be optimal for all downstream tasks

Embedding quality depends on model architecture and training procedure, not just dataset quality

What makes it unique

vs alternatives

code-to-text generation dataset for documentation and explanation

Medium confidence

Solves for

Best for

Teams building automatic documentation generation tools

ML researchers training code-to-text generation models

Organizations implementing code quality and documentation standards

Requires

HuggingFace Datasets library

Python 3.7+

Text generation models (e.g., CodeT5, BART, T5)

Limitations

Expert-generated descriptions may reflect specific documentation styles and may not generalize to all code description patterns

Text generation quality depends heavily on model architecture and training procedure, not just dataset quality

Descriptions may be incomplete or may not capture all nuances of code behavior

What makes it unique

vs alternatives

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

xCodeEval

Capabilities7 decomposed

multilingual code-to-code translation dataset construction

code clone detection dataset with multilingual support

code search and retrieval dataset with natural language queries

code question-answering dataset with multilingual code context

code feature extraction and token classification dataset

multilingual code representation learning through contrastive pairs

code-to-text generation dataset for documentation and explanation

Related Artifactssharing capabilities

CodeSearchNet

xCodeEval

The Stack v2

CodeT5

StarCoderData

DeepSeek V3

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to xCodeEval

Are you the builder of xCodeEval?

Get the weekly brief

Data Sources

xCodeEval

Capabilities7 decomposed

multilingual code-to-code translation dataset construction

code clone detection dataset with multilingual support

code search and retrieval dataset with natural language queries

code question-answering dataset with multilingual code context

code feature extraction and token classification dataset

multilingual code representation learning through contrastive pairs

code-to-text generation dataset for documentation and explanation

Related Artifactssharing capabilities

CodeSearchNet

xCodeEval

The Stack v2

CodeT5

StarCoderData

DeepSeek V3

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to xCodeEval

Are you the builder of xCodeEval?

Get the weekly brief

Data Sources