{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"humaneval","slug":"humaneval","name":"HumanEval","type":"benchmark","url":"https://github.com/openai/human-eval","page_url":"https://unfragile.ai/humaneval","categories":["testing-quality"],"tags":[],"pricing":{"model":"free","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"humaneval__cap_0","uri":"capability://data.processing.analysis.hand.crafted.programming.problem.dataset.with.canonical.solutions","name":"hand-crafted programming problem dataset with canonical solutions","description":"Provides a curated collection of 164 Python programming problems designed to test code generation capabilities, each with a unique task ID, natural language prompt, function signature, canonical reference implementation, and comprehensive test cases. Problems are stored in JSONL.gz format and loaded via the read_problems() function in data.py, enabling reproducible evaluation across different code generation models.","intents":["benchmark my code generation model against a standardized dataset","understand what types of programming tasks my LLM struggles with","compare performance across multiple code generation approaches using identical test cases"],"best_for":["ML researchers evaluating code generation models","LLM developers measuring functional correctness improvements","teams building code synthesis tools who need reproducible baselines"],"limitations":["Limited to 164 problems — may not capture domain-specific code generation tasks","Python-only dataset — cannot evaluate code generation for other languages","Problems are relatively short (function-level) — does not test multi-file or large-scale code generation","Hand-crafted nature means potential bias toward certain problem types or difficulty distributions"],"requires":["Python 3.6+","HumanEval package installed via pip","Access to HumanEval.jsonl.gz dataset file"],"input_types":["JSONL.gz file containing problem definitions"],"output_types":["structured problem objects with task_id, prompt, entry_point, canonical_solution, test fields"],"categories":["data-processing-analysis","benchmark-dataset"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"humaneval__cap_1","uri":"capability://code.generation.editing.sandboxed.code.execution.with.timeout.and.resource.limits","name":"sandboxed code execution with timeout and resource limits","description":"Executes untrusted Python code in an isolated environment via the unsafe_execute() function in execution.py, with built-in protections including configurable timeout (default 10 seconds), memory limits, and exception handling. The execution engine runs generated code against problem test cases and captures pass/fail results without exposing the host system to malicious or runaway code.","intents":["safely run code generated by LLMs without risking system compromise","detect infinite loops or resource exhaustion in generated code","test code completions against multiple test cases and capture detailed failure information"],"best_for":["researchers evaluating untrusted code generation models","automated CI/CD pipelines that need to test LLM-generated code","teams building code synthesis tools with safety requirements"],"limitations":["Timeout mechanism is process-level only — does not prevent all forms of resource exhaustion (e.g., memory bombs)","Sandboxing is not cryptographically isolated — suitable for research but not production multi-tenant systems","No network isolation — generated code can make outbound requests","Python-only execution — cannot test code generation for other languages"],"requires":["Python 3.6+","Unix-like OS (Linux, macOS) — Windows support is limited","Ability to spawn child processes with signal handling"],"input_types":["Python code string","test case string","entry point function name"],"output_types":["boolean pass/fail result","exception details if code fails","execution time in seconds"],"categories":["code-generation-editing","safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"humaneval__cap_2","uri":"capability://code.generation.editing.functional.correctness.testing.via.unit.test.execution","name":"functional correctness testing via unit test execution","description":"Tests generated code against problem-specific test cases via the check_correctness() function in execution.py, which executes both the canonical solution and generated code against identical test suites to verify functional equivalence. Test cases are embedded in each problem definition and executed in the sandboxed environment, with detailed failure reporting including assertion errors and exception traces.","intents":["verify that generated code produces correct outputs for all test cases","identify specific test cases where generated code fails","compare correctness of different code generation approaches on identical problems"],"best_for":["evaluating code generation models on functional correctness metrics","debugging why generated code fails specific test cases","building automated test suites for code synthesis systems"],"limitations":["Test coverage depends on problem author's test case design — may miss edge cases not covered by tests","Only tests function-level correctness — does not evaluate code quality, efficiency, or readability","Assumes test cases are deterministic — non-deterministic code may produce inconsistent results","No support for parameterized or property-based testing"],"requires":["Python 3.6+","Problem definition with test field containing valid Python test code","Generated code that matches the function signature specified in entry_point"],"input_types":["generated code string","problem definition with test cases","entry point function name"],"output_types":["boolean pass/fail result","detailed error message if test fails","execution time"],"categories":["code-generation-editing","testing-quality"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"humaneval__cap_3","uri":"capability://data.processing.analysis.pass.k.metric.calculation.with.unbiased.statistical.estimation","name":"pass@k metric calculation with unbiased statistical estimation","description":"Calculates the pass@k metric via estimate_pass_at_k() in evaluation.py, which estimates the probability that at least one of k code samples passes all test cases for a given problem. Uses an unbiased estimator that accounts for sampling without replacement, enabling fair comparison of code generation models that produce different numbers of samples per problem.","intents":["measure code generation model performance using the standard pass@k metric","compare models that generate different numbers of samples per problem fairly","understand the relationship between sample count and correctness probability"],"best_for":["researchers publishing code generation benchmarks","teams comparing multiple code generation models","building leaderboards for code synthesis tasks"],"limitations":["Pass@k only measures correctness, not code quality or efficiency","Assumes samples are independent — does not account for correlated failures across samples","Requires at least k passing samples to calculate meaningful statistics — unreliable for very low pass rates","Does not provide confidence intervals or statistical significance testing"],"requires":["Python 3.6+","List of pass/fail results for multiple samples per problem","k value (number of samples to consider)"],"input_types":["list of boolean pass/fail results","k value as integer"],"output_types":["float between 0.0 and 1.0 representing estimated pass@k probability","per-problem pass@k scores"],"categories":["data-processing-analysis","testing-quality"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"humaneval__cap_4","uri":"capability://data.processing.analysis.jsonl.based.completion.input.output.pipeline","name":"jsonl-based completion input/output pipeline","description":"Provides stream_jsonl() and write_jsonl() functions in data.py for reading code completions from JSONL files and writing evaluation results back to JSONL format. Each completion record contains task_id, completion string, and optional metadata; results include pass/fail status, detailed error messages, and execution metrics. This format enables efficient processing of large batches of completions without loading entire datasets into memory.","intents":["load code completions generated by external models into HumanEval for evaluation","save evaluation results in a structured format for downstream analysis","process large batches of completions efficiently without memory overhead"],"best_for":["teams integrating HumanEval into ML pipelines","researchers evaluating multiple code generation models","building automated evaluation workflows"],"limitations":["JSONL format requires one record per line — does not support pretty-printed JSON","No built-in schema validation — malformed records are skipped silently","Streaming approach means results are not aggregated in memory — requires post-processing for statistics","No support for compressed output — results files can be large"],"requires":["Python 3.6+","Input JSONL file with task_id and completion fields","Write permissions to output directory"],"input_types":["JSONL file with completion records","each record: {task_id: string, completion: string}"],"output_types":["JSONL file with evaluation results","each record: {task_id: string, passed: boolean, result: string, ...}"],"categories":["data-processing-analysis","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"humaneval__cap_5","uri":"capability://automation.workflow.command.line.evaluation.orchestration","name":"command-line evaluation orchestration","description":"Provides a CLI tool (evaluate_functional_correctness) that orchestrates the entire evaluation pipeline: reads completions from JSONL, executes code in sandbox, runs test cases, calculates pass@k metrics, and writes results to output file. Supports configurable k values via --k parameter and parallelizes evaluation across multiple problems using Python's multiprocessing module.","intents":["run end-to-end evaluation of code completions without writing custom Python code","evaluate multiple k values in a single run","integrate HumanEval into shell scripts and CI/CD pipelines"],"best_for":["researchers running quick benchmarks from command line","CI/CD pipelines that need to evaluate code generation models","teams without Python expertise who want to use HumanEval"],"limitations":["Limited configuration options compared to programmatic API — cannot customize timeout or resource limits","Parallelization is fixed to number of CPU cores — no way to limit concurrency","Error handling is minimal — failures in one problem may crash entire evaluation","No progress reporting or logging — difficult to debug issues in large evaluation runs"],"requires":["Python 3.6+","HumanEval package installed with CLI entry point","Input JSONL file with completions","Unix-like shell environment"],"input_types":["JSONL file path (positional argument)","k values as comma-separated integers (--k parameter)"],"output_types":["JSONL results file (input_file_results.jsonl)","stdout with summary statistics"],"categories":["automation-workflow","tool-use-integration"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"humaneval__cap_6","uri":"capability://code.generation.editing.problem.specific.test.case.isolation.and.execution","name":"problem-specific test case isolation and execution","description":"Executes test cases in isolated Python scopes via check_correctness() function, which creates a fresh namespace for each code sample and test execution to prevent state leakage between problems. Test code is executed after the generated function is defined, with explicit assertion statements that raise exceptions on failure, enabling precise error reporting without requiring external test frameworks.","intents":["ensure test cases for different problems do not interfere with each other","capture detailed error messages when test assertions fail","validate that generated code works correctly in isolation"],"best_for":["evaluating code generation models where test isolation is critical","debugging specific test failures with detailed error traces","ensuring reproducibility across evaluation runs"],"limitations":["Test isolation is namespace-based only — does not prevent global state mutations (e.g., file system changes)","Requires test cases to use explicit assertions — does not support implicit test frameworks","No support for test fixtures or setup/teardown logic","Test code must be valid Python — syntax errors in test cases crash evaluation"],"requires":["Python 3.6+","Problem definition with valid Python test code","Generated code that defines the required function"],"input_types":["generated code string","test code string","entry point function name"],"output_types":["boolean pass/fail result","AssertionError or exception details if test fails"],"categories":["code-generation-editing","testing-quality"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"humaneval__cap_7","uri":"capability://data.processing.analysis.multi.sample.code.generation.evaluation.with.statistical.aggregation","name":"multi-sample code generation evaluation with statistical aggregation","description":"Supports evaluating multiple code samples per problem via the evaluate_functional_correctness() function, which processes JSONL files containing multiple completions per task_id and aggregates results to calculate per-problem pass@k statistics. Handles variable numbers of samples per problem and produces both per-sample and aggregated metrics in output JSONL.","intents":["evaluate code generation models that produce multiple samples per problem","understand how sample diversity affects correctness probability","aggregate results across multiple samples for statistical analysis"],"best_for":["evaluating models like Codex or GPT-4 that generate multiple samples","analyzing the relationship between sample count and pass@k","comparing models with different sampling strategies"],"limitations":["Assumes samples are independent — does not detect or handle correlated failures","Requires all samples for a problem to be evaluated before aggregation — no streaming aggregation","No support for weighted samples or importance sampling","Memory usage scales with number of samples per problem"],"requires":["Python 3.6+","JSONL file with multiple records per task_id","Each record must have task_id and completion fields"],"input_types":["JSONL file with multiple completions per task_id"],"output_types":["JSONL results file with per-sample and aggregated pass@k metrics"],"categories":["data-processing-analysis","testing-quality"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"humaneval__headline","uri":"capability://testing.quality.code.generation.evaluation.benchmark","name":"code generation evaluation benchmark","description":"HumanEval is an industry-standard benchmark for assessing the functional correctness of code generated by AI models, featuring 164 curated Python programming problems with unit tests to ensure reliable evaluation.","intents":["best code generation benchmark","code evaluation tool for AI models","how to test AI-generated code","top benchmarks for code correctness","evaluate code generation performance"],"best_for":["AI model developers","researchers in code generation"],"limitations":["limited to Python problems"],"requires":["Python environment"],"input_types":["AI-generated code"],"output_types":["evaluation metrics"],"categories":["testing-quality"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":61,"verified":false,"data_access_risk":"high","permissions":["Python 3.6+","HumanEval package installed via pip","Access to HumanEval.jsonl.gz dataset file","Unix-like OS (Linux, macOS) — Windows support is limited","Ability to spawn child processes with signal handling","Problem definition with test field containing valid Python test code","Generated code that matches the function signature specified in entry_point","List of pass/fail results for multiple samples per problem","k value (number of samples to consider)","Input JSONL file with task_id and completion fields"],"failure_modes":["Limited to 164 problems — may not capture domain-specific code generation tasks","Python-only dataset — cannot evaluate code generation for other languages","Problems are relatively short (function-level) — does not test multi-file or large-scale code generation","Hand-crafted nature means potential bias toward certain problem types or difficulty distributions","Timeout mechanism is process-level only — does not prevent all forms of resource exhaustion (e.g., memory bombs)","Sandboxing is not cryptographically isolated — suitable for research but not production multi-tenant systems","No network isolation — generated code can make outbound requests","Python-only execution — cannot test code generation for other languages","Test coverage depends on problem author's test case design — may miss edge cases not covered by tests","Only tests function-level correctness — does not evaluate code quality, efficiency, or readability","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.7,"quality":0.8500000000000001,"ecosystem":0.39999999999999997,"match_graph":0.25,"freshness":0.52,"weights":{"adoption":0.25,"quality":0.35,"ecosystem":0.15,"match_graph":0.2,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-06-17T09:51:04.692Z","last_scraped_at":null,"last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=humaneval","compare_url":"https://unfragile.ai/compare?artifact=humaneval"}},"signature":"ZffFoBeM3TzBTghikn01OSUajv8qZwKTvXYpef5kF23Sma5WEtoPNE3sBQGmnc2j0As3Iewc28G8YlKdVHCYDw==","signedAt":"2026-06-21T07:34:41.272Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/humaneval","artifact":"https://unfragile.ai/humaneval","verify":"https://unfragile.ai/api/v1/verify?slug=humaneval","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}