HumanEval
BenchmarkFreeOpenAI's code generation benchmark — 164 Python problems with unit tests, pass@k evaluation.
Capabilities8 decomposed
hand-crafted programming problem dataset with canonical solutions
Medium confidenceProvides a curated collection of 164 Python programming problems designed to test code generation capabilities, each with a unique task ID, natural language prompt, function signature, canonical reference implementation, and comprehensive test cases. Problems are stored in JSONL.gz format and loaded via the read_problems() function in data.py, enabling reproducible evaluation across different code generation models.
Hand-crafted by OpenAI with deliberate problem diversity covering algorithms, data structures, and edge cases; each problem includes a canonical solution and comprehensive test suite designed to catch subtle correctness issues rather than surface-level syntax errors
More rigorous and widely-adopted than crowdsourced alternatives because problems were vetted by domain experts and test cases are designed to catch functional bugs, not just runtime errors
sandboxed code execution with timeout and resource limits
Medium confidenceExecutes untrusted Python code in an isolated environment via the unsafe_execute() function in execution.py, with built-in protections including configurable timeout (default 10 seconds), memory limits, and exception handling. The execution engine runs generated code against problem test cases and captures pass/fail results without exposing the host system to malicious or runaway code.
Uses signal-based timeout mechanism (SIGALRM on Unix) combined with exception wrapping to safely execute untrusted code without requiring containerization, making it lightweight for research workflows while still preventing infinite loops and resource exhaustion
Simpler and faster than container-based approaches (Docker) for research benchmarking because it avoids container startup overhead, while still providing adequate isolation for non-adversarial code generation evaluation
functional correctness testing via unit test execution
Medium confidenceTests generated code against problem-specific test cases via the check_correctness() function in execution.py, which executes both the canonical solution and generated code against identical test suites to verify functional equivalence. Test cases are embedded in each problem definition and executed in the sandboxed environment, with detailed failure reporting including assertion errors and exception traces.
Executes test cases in the same sandboxed environment as generated code, ensuring identical execution context and preventing false positives from environment-dependent behavior; test cases are embedded in problem definitions rather than stored separately, ensuring tight coupling between problems and their validation logic
More reliable than static analysis or type checking because it actually executes code and validates outputs, while being simpler than property-based testing frameworks because test cases are hand-written and problem-specific
pass@k metric calculation with unbiased statistical estimation
Medium confidenceCalculates the pass@k metric via estimate_pass_at_k() in evaluation.py, which estimates the probability that at least one of k code samples passes all test cases for a given problem. Uses an unbiased estimator that accounts for sampling without replacement, enabling fair comparison of code generation models that produce different numbers of samples per problem.
Implements unbiased pass@k estimator that corrects for sampling without replacement, preventing overestimation of model performance when fewer than k samples are available; formula accounts for the hypergeometric distribution rather than assuming independence
More statistically rigorous than naive pass@k calculation (which assumes independence) because it uses the unbiased estimator formula, enabling fair comparison of models with different sample budgets
jsonl-based completion input/output pipeline
Medium confidenceProvides stream_jsonl() and write_jsonl() functions in data.py for reading code completions from JSONL files and writing evaluation results back to JSONL format. Each completion record contains task_id, completion string, and optional metadata; results include pass/fail status, detailed error messages, and execution metrics. This format enables efficient processing of large batches of completions without loading entire datasets into memory.
Uses streaming JSONL parsing to avoid loading entire completion datasets into memory, enabling evaluation of millions of samples on resource-constrained systems; results are written incrementally as evaluations complete rather than buffered
More memory-efficient than CSV or JSON alternatives because streaming parser processes one record at a time, while still maintaining structured format compatibility with standard data tools
command-line evaluation orchestration
Medium confidenceProvides a CLI tool (evaluate_functional_correctness) that orchestrates the entire evaluation pipeline: reads completions from JSONL, executes code in sandbox, runs test cases, calculates pass@k metrics, and writes results to output file. Supports configurable k values via --k parameter and parallelizes evaluation across multiple problems using Python's multiprocessing module.
Single-command evaluation pipeline that chains data loading, code execution, testing, and metric calculation without requiring intermediate file handling; uses Python multiprocessing to parallelize problem evaluation across CPU cores automatically
Simpler than writing custom evaluation scripts because it handles all pipeline stages in one command, while being more flexible than web-based benchmarking platforms because it runs locally without network dependencies
problem-specific test case isolation and execution
Medium confidenceExecutes test cases in isolated Python scopes via check_correctness() function, which creates a fresh namespace for each code sample and test execution to prevent state leakage between problems. Test code is executed after the generated function is defined, with explicit assertion statements that raise exceptions on failure, enabling precise error reporting without requiring external test frameworks.
Uses Python's exec() with isolated namespace dictionaries to ensure each problem's test execution does not affect others, combined with exception wrapping to capture and report assertion failures with full stack traces
More reliable than pytest or unittest frameworks for this use case because it avoids framework overhead and provides direct control over execution context, while still capturing detailed failure information
multi-sample code generation evaluation with statistical aggregation
Medium confidenceSupports evaluating multiple code samples per problem via the evaluate_functional_correctness() function, which processes JSONL files containing multiple completions per task_id and aggregates results to calculate per-problem pass@k statistics. Handles variable numbers of samples per problem and produces both per-sample and aggregated metrics in output JSONL.
Processes variable-length sample lists per problem and calculates pass@k for each k value in a single pass, using the unbiased estimator to handle problems with fewer samples than k
More efficient than running separate evaluations for each k value because it calculates all k values from a single set of pass/fail results, while supporting arbitrary numbers of samples per problem
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with HumanEval, ranked by overlap. Discovered automatically through the match graph.
phantom-lens
A Cluely / Interview Coder alternative with features we probably shouldn’t talk about, built for winning exams..
CodeContests
13K competitive programming problems from AlphaCode research.
LiveCodeBench
Continuously updated coding benchmark — new competitive programming problems, prevents contamination.
Code Coach
Master FAANG interviews with AI-driven, instant-feedback...
Aider Polyglot
Multi-language AI coding benchmark — tests code editing ability across 10+ languages.
DS-1000
1,000 data science problems across 7 Python libraries.
Best For
- ✓ML researchers evaluating code generation models
- ✓LLM developers measuring functional correctness improvements
- ✓teams building code synthesis tools who need reproducible baselines
- ✓researchers evaluating untrusted code generation models
- ✓automated CI/CD pipelines that need to test LLM-generated code
- ✓teams building code synthesis tools with safety requirements
- ✓evaluating code generation models on functional correctness metrics
- ✓debugging why generated code fails specific test cases
Known Limitations
- ⚠Limited to 164 problems — may not capture domain-specific code generation tasks
- ⚠Python-only dataset — cannot evaluate code generation for other languages
- ⚠Problems are relatively short (function-level) — does not test multi-file or large-scale code generation
- ⚠Hand-crafted nature means potential bias toward certain problem types or difficulty distributions
- ⚠Timeout mechanism is process-level only — does not prevent all forms of resource exhaustion (e.g., memory bombs)
- ⚠Sandboxing is not cryptographically isolated — suitable for research but not production multi-tenant systems
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
OpenAI's benchmark for evaluating code generation. 164 hand-crafted Python programming problems with unit tests. Measures functional correctness (pass@k). The original and most cited code generation benchmark.
Categories
Alternatives to HumanEval
Are you the builder of HumanEval?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →