LiveCodeBench
BenchmarkFreeContinuously updated coding benchmark — new competitive programming problems, prevents contamination.
Capabilities13 decomposed
temporal-contamination-detection-via-problem-release-dating
Medium confidenceAnnotates each benchmark problem with its release date from source platforms (LeetCode, AtCoder, Codeforces), enabling detection of data contamination by comparing model performance across temporal cohorts. When a model's performance drops sharply at its training cutoff date, it indicates earlier problems were likely in training data. This design allows researchers to identify which models have been exposed to benchmark problems during pretraining without requiring explicit data audits.
Uses temporal annotation of problems from live competitive platforms as a built-in contamination detector rather than relying on external audits or data provenance tracking. DeepSeek models showed 'stark drop in performance on LeetCode problems released since September 2023' (their release date), demonstrating the mechanism's effectiveness at identifying exposure to benchmark data.
More practical than static benchmarks like HumanEval because it continuously incorporates new problems post-dated after model training, making contamination immediately detectable through performance degradation rather than requiring retrospective data audits.
continuous-problem-ingestion-from-competitive-platforms
Medium confidenceAutomatically or semi-automatically ingests new coding problems from active competitive programming platforms (LeetCode, AtCoder, Codeforces) with release date metadata, maintaining a rolling window of 300+ problems spanning May 2023 to February 2024 and beyond. Problems are curated for quality and difficulty distribution, then integrated into the benchmark evaluation pipeline with standardized input/output formats and test case extraction.
Treats competitive programming platforms as live data sources rather than static snapshots, with automated or semi-automated ingestion pipelines that preserve release date metadata. This enables the benchmark to grow continuously and stay ahead of model training cutoffs, unlike static benchmarks that become stale within months of release.
Outpaces static benchmarks like HumanEval (165 problems, last updated 2021) by continuously incorporating new problems from active platforms, making it harder for models to memorize solutions and enabling contamination detection through temporal analysis.
open-source-benchmark-infrastructure-and-reproducibility
Medium confidenceProvides open-source code repository and data access for the benchmark, enabling researchers to reproduce evaluation results, extend the benchmark with new problems or scenarios, and run local evaluations without relying on a centralized service. Code repository includes evaluation scripts, problem parsing logic, and leaderboard infrastructure. Data access includes problem statements, test cases, and evaluation results, enabling offline analysis and custom evaluation pipelines.
Provides open-source infrastructure for benchmark evaluation and data access, enabling reproducibility and community contributions. This is less common than closed leaderboards and supports the benchmark's goal of maintaining integrity through transparency.
More transparent and reproducible than closed benchmarks like OpenAI's Evals because it provides open-source code and data, enabling independent verification and community contributions.
problem-difficulty-and-category-stratification
Medium confidenceOrganizes benchmark problems by difficulty levels and categories (implied from competitive programming problem taxonomies), enabling evaluation of model performance across problem subsets. Allows analysis of whether models perform consistently across difficulty levels or show degradation on harder problems. Enables targeted evaluation of specific problem categories (e.g., dynamic programming, graph algorithms, string manipulation) to identify capability gaps.
Enables stratified analysis of model performance across difficulty levels and problem categories, revealing whether models have consistent capability or show degradation on harder problems. This level of detail is not provided by single-metric benchmarks.
More granular than aggregate leaderboards because it enables analysis of performance across problem subsets, revealing capability gaps that aggregate metrics might hide.
continuous leaderboard updates with new problem results
Medium confidenceAutomatically updates the public leaderboard as new problems are added to the benchmark and models are re-evaluated against the expanded problem set. This ensures the leaderboard reflects the current benchmark state and prevents models from achieving artificially high scores on a fixed problem set. The continuous update mechanism is enabled by the automated problem ingestion pipeline and evaluation infrastructure.
Implements continuous leaderboard updates as problems are added, preventing benchmark stagnation and gaming; most benchmarks (HumanEval, MBPP) use static problem sets with infrequent updates
Continuous updates ensure leaderboard reflects current benchmark state and prevent gaming; static benchmarks become outdated and contaminated as model training data grows
multi-scenario-code-capability-evaluation
Medium confidenceEvaluates models across four distinct code-related scenarios: (1) free-form code generation from problem descriptions, (2) self-repair of broken code, (3) test output prediction without execution, and (4) code execution with result validation. Each scenario tests different aspects of code understanding and generation, with separate scoring and leaderboard rankings. Models are ranked differently across scenarios, revealing capability gaps (e.g., Claude-3-Opus excels at test output prediction but not code generation).
Decomposes code capability into four orthogonal scenarios rather than treating code generation as a monolithic task. This reveals that model rankings are scenario-dependent (Claude-3-Opus beats GPT-4-Turbo on test output prediction but not code generation) and that some models overfit to generation benchmarks while failing at reasoning tasks like output prediction.
More comprehensive than single-scenario benchmarks like HumanEval because it tests code understanding (output prediction), repair (self-repair), and execution validation in addition to generation, exposing capability gaps that single-metric benchmarks miss.
pass-at-k-scoring-with-multiple-generation-attempts
Medium confidenceEvaluates code generation by allowing models multiple attempts to produce a correct solution (pass@k metric), where k typically ranges from 1 to 10. A problem is marked as 'passed' if any of the k generated solutions produces correct output on all test cases. This metric accounts for the stochastic nature of LLM generation and rewards models that can explore solution space diversity, rather than penalizing single-attempt failures.
Applies pass@k metric from prior code generation benchmarks (HumanEval, MBPP) to LiveCodeBench's continuously-updated problem set, enabling fair comparison of models with different generation strategies while accounting for sampling variance inherent in LLM outputs.
More realistic than pass@1 metrics because it acknowledges that LLMs generate stochastically and users can sample multiple times; more fair than fixed-temperature evaluation because it doesn't penalize models with higher generation diversity.
code-execution-validation-with-test-case-matching
Medium confidenceExecutes generated code against a suite of test cases extracted from competitive programming problems, comparing actual output to expected output with exact string matching or semantic equivalence checking. Execution occurs in a controlled environment (sandboxing details unknown) with timeout and resource limits to prevent infinite loops or resource exhaustion. Problems are marked as 'passed' only if generated code produces correct output on all test cases.
Integrates code execution as a core evaluation component rather than relying solely on static analysis or LLM-based correctness prediction. This enables objective, reproducible evaluation of code correctness without manual review, leveraging test cases from competitive programming problems that are designed to catch common errors.
More rigorous than LLM-based code review because it executes code against actual test cases rather than asking another LLM to judge correctness; more comprehensive than syntax-only validation because it catches logic errors and edge case failures.
test-output-prediction-without-code-execution
Medium confidenceEvaluates models' ability to predict the output of code without executing it, testing code understanding and reasoning about program behavior. Models are given code and test inputs, then asked to predict the output. Predictions are compared against expected outputs with accuracy scoring. This scenario tests whether models understand code semantics deeply enough to trace execution mentally, without relying on actual runtime behavior.
Isolates code understanding and reasoning from code generation by asking models to predict outputs without executing code. This reveals that some models (Claude-3-Opus, Mistral-Large) excel at reasoning-heavy tasks while others (GPT-4-Turbo) are stronger at generation, suggesting different capability profiles.
Tests a capability that code generation benchmarks miss entirely; more aligned with code review and debugging tasks than pure generation metrics, revealing that model rankings vary significantly across scenarios.
self-repair-capability-evaluation
Medium confidenceEvaluates models' ability to identify and fix broken code, testing debugging and repair capabilities. Models are given code with known bugs or errors, then asked to produce corrected versions. Corrected code is validated against test cases to determine if repairs were successful. This scenario tests whether models can reason about code correctness and apply fixes, beyond just generating code from scratch.
Tests code repair as a distinct capability from generation, revealing whether models can identify and fix bugs in existing code. This is less documented than generation or output prediction, but represents a practical capability for code review and refactoring workflows.
Complements code generation benchmarks by testing repair capability, which is relevant for real-world development workflows where developers spend significant time debugging and refactoring existing code rather than writing from scratch.
multi-model-leaderboard-with-scenario-rankings
Medium confidenceMaintains a public leaderboard ranking 29+ LLMs across four evaluation scenarios (code generation, self-repair, test output prediction, code execution), with separate rankings per scenario and optional aggregate rankings. Leaderboard includes both closed-access API models (GPT-4-Turbo, Claude-3-Opus, Mistral-Large) and open-access models (fine-tuned variants of 30+B parameter models). Rankings are updated as new problems are added and models are re-evaluated, enabling longitudinal tracking of capability improvements.
Provides scenario-specific rankings rather than a single aggregate score, revealing that model capabilities vary significantly across code generation, repair, and reasoning tasks. This transparency prevents false conclusions about 'best' models and encourages task-specific model selection.
More nuanced than single-metric leaderboards like HumanEval because it ranks models separately across four scenarios, revealing capability gaps and preventing overfitting to generation-only benchmarks. Continuous updates with new problems prevent leaderboard saturation and gaming.
contamination-evidence-analysis-and-reporting
Medium confidenceAnalyzes and reports evidence of data contamination by comparing model performance across temporal cohorts of problems. When a model shows a 'stark drop in performance' at its training cutoff date (e.g., DeepSeek models perform well on problems from May 2023 to September 2023, then drop sharply on problems released after September 2023), this indicates the earlier problems were likely in training data. Reports include performance curves, statistical summaries, and contamination confidence assessments, enabling researchers to identify and flag contaminated models.
Provides concrete, evidence-based contamination detection by analyzing performance degradation at model training cutoffs, rather than relying on external audits or data provenance tracking. DeepSeek models' 'stark drop in performance on LeetCode problems released since September 2023' provides clear evidence of contamination that would be missed by static benchmarks.
More practical and automated than manual data audits because it uses temporal analysis to detect contamination automatically; more reliable than relying on model developers' claims about training data because it provides empirical evidence.
overfitting-detection-across-benchmarks
Medium confidenceIdentifies models that overfit to other benchmarks (e.g., HumanEval) by comparing their performance on LiveCodeBench against their HumanEval scores. Models are clustered into two groups: those that generalize well to LiveCodeBench-Easy (green cluster) and those that overfit to HumanEval (red cluster). Example: 'DS-Ins-1.3B model outperforms Gemini-Pro and Claude-Ins-1 on HumanEval but performs considerably worse on LCB-Easy', indicating overfitting to HumanEval's specific problem distribution.
Detects overfitting to other benchmarks by comparing cross-benchmark performance, revealing that some fine-tuned models achieve high HumanEval scores through overfitting rather than true capability improvements. This is a meta-level quality check on benchmark integrity that most benchmarks don't provide.
Unique among code generation benchmarks because it explicitly detects and reports overfitting to other benchmarks, helping practitioners avoid models that are optimized for specific benchmarks rather than general code generation capability.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with LiveCodeBench, ranked by overlap. Discovered automatically through the match graph.
LiveBench
Continuously updated contamination-free LLM benchmark.
CodeContests
13K competitive programming problems from AlphaCode research.
APPS (Automated Programming Progress Standard)
10K coding problems across 3 difficulty levels with test suites.
Humanity's Last Exam
Hardest exam questions from thousands of experts.
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of lang... (BIG-bench)
* ⭐ 06/2022: [Solving Quantitative Reasoning Problems with Language Models (Minerva)](https://arxiv.org/abs/2206.14858)
DS-1000
1,000 data science problems across 7 Python libraries.
Best For
- ✓benchmark maintainers validating model integrity
- ✓researchers comparing models across different training periods
- ✓organizations auditing LLM training data exposure
- ✓benchmark maintainers needing to stay ahead of model training data
- ✓researchers studying model generalization on unseen problem distributions
- ✓organizations running continuous model evaluation pipelines
- ✓researchers requiring reproducible, auditable benchmarks
- ✓organizations running private evaluations with proprietary models
Known Limitations
- ⚠only detects contamination for problems released after model training cutoff; problems from May 2023 onwards may still be in training data for models trained after that period
- ⚠requires accurate model training date metadata; undisclosed or approximate training dates reduce detection reliability
- ⚠performance variance from other factors (model scale, fine-tuning, inference parameters) can obscure contamination signals
- ⚠problem ingestion pipeline and curation criteria are not documented; unclear how problems are selected, validated, or filtered for quality
- ⚠exact distribution across difficulty levels, problem categories, and language paradigms is unknown
- ⚠continuous updates may introduce inconsistency if curation standards drift over time
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Continuously updated code generation benchmark using new problems from competitive programming platforms. Prevents data contamination since problems post-date model training. Tests code generation, self-repair, and test output prediction.
Categories
Alternatives to LiveCodeBench
Are you the builder of LiveCodeBench?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →