EvalPlus

BenchmarkFree

Extended code evaluation with harder test cases for HumanEval

Open Source

/ 100

1 capabilities

Capabilities1 decomposed

extended test case generation for code evaluation

Medium confidence

EvalPlus enhances the HumanEval benchmark by providing additional, more challenging test cases for each of the original 164 problems, extending the evaluation scope to over 40 test cases per problem. This is achieved by systematically generating diverse edge cases and complex scenarios that challenge models to demonstrate true coding proficiency rather than simply overfitting to the original tests. The approach focuses on rigorous evaluation, ensuring that models are tested against a broader range of inputs and conditions, which is crucial for assessing their real-world applicability.

Solves for

How can I rigorously evaluate my model's coding capabilities?What benchmark can I use to ensure my AI doesn't overfit to simpler test cases?I need a comprehensive set of test cases for code generation models.

Best for

researchers validating AI code generation models

developers looking for robust evaluation metrics

Requires

Python 3.7+

Access to the EvalPlus dataset

Limitations

Test cases may not cover all possible edge cases, leading to potential gaps in evaluation

Requires significant computational resources for extensive testing

What makes it unique

The unique aspect of EvalPlus lies in its systematic approach to generating a wide array of challenging test cases that extend beyond the original HumanEval, ensuring a more rigorous evaluation of model capabilities.

vs alternatives

More comprehensive than standard benchmarks like HumanEval, as it includes a significantly larger and more challenging set of test cases.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with EvalPlus, ranked by overlap. Discovered automatically through the match graph.

Product18

Mutable AI

AI-Accelerated Software Development

test case generation from code specifications

1 shared capability

Extension46

Lingma - Alibaba Cloud AI Coding Assistant

Type Less, Code More

unit test generation

1 shared capability

Product20

YCombinator

[Twitter](https://twitter.com/SecondDevHQ)

intelligent test generation from code and specifications

1 shared capability

Product18

Codegen

Solve tickets, write tests, level up your workflow

test case generation

1 shared capability

Product43

Codex

Streamlines coding with AI-driven generation, debugging, and...

test case generation from code specifications

1 shared capability

Model20

OpenAI: GPT-5.1-Codex-Mini

GPT-5.1-Codex-Mini is a smaller and faster version of GPT-5.1-Codex

test case generation and test code writing

1 shared capability

Best For

✓researchers validating AI code generation models
✓developers looking for robust evaluation metrics

Known Limitations

⚠Test cases may not cover all possible edge cases, leading to potential gaps in evaluation
⚠Requires significant computational resources for extensive testing

Requirements

Python 3.7+Access to the EvalPlus dataset

Input / Output

Accepts: code

Produces: structured data, evaluation metrics

UnfragileRank

Adoption80%(25% weight)

Quality17%(35% weight)

Ecosystem52%(15% weight)

Match Graph25%(20% weight)

Freshness75%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Benchmark

1 capabilities

Visit EvalPlus→

About

EvalPlus extends HumanEval with additional test cases per problem (up to 40+ per problem). Same 164 problems but much harder test sets. Catches models that overfit to the original HumanEval test cases. Better for rigorous code evaluation.

Alternatives to EvalPlus

SWE-bench48Benchmark

Real-world software engineering task evaluation suite

Compare →

HumanEval47Benchmark

OpenAI's standard for evaluating code generation models

Compare →

MBPP43Benchmark

Mostly Basic Programming Problems (beginner-friendly code)

Compare →

LiveCodeBench43Benchmark

Live coding benchmark with recent LeetCode problems

Compare →

Are you the builder of EvalPlus?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

papers with code

Looking for something else?

Search →

EvalPlus

BenchmarkFree

Extended code evaluation with harder test cases for HumanEval

Open Source

/ 100

1 capabilities

Capabilities1 decomposed

extended test case generation for code evaluation

Medium confidence

Solves for

Best for

researchers validating AI code generation models

developers looking for robust evaluation metrics

Requires

Python 3.7+

Access to the EvalPlus dataset

Limitations

Test cases may not cover all possible edge cases, leading to potential gaps in evaluation

Requires significant computational resources for extensive testing

What makes it unique

vs alternatives

More comprehensive than standard benchmarks like HumanEval, as it includes a significantly larger and more challenging set of test cases.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with EvalPlus, ranked by overlap. Discovered automatically through the match graph.

Product18

Mutable AI

AI-Accelerated Software Development

test case generation from code specifications

1 shared capability

Extension46

Lingma - Alibaba Cloud AI Coding Assistant

Type Less, Code More

unit test generation

1 shared capability

Product20

YCombinator

[Twitter](https://twitter.com/SecondDevHQ)

intelligent test generation from code and specifications

1 shared capability

Product18

Codegen

Solve tickets, write tests, level up your workflow

test case generation

1 shared capability

Product43

Codex

Streamlines coding with AI-driven generation, debugging, and...

test case generation from code specifications

1 shared capability

Model20

OpenAI: GPT-5.1-Codex-Mini

GPT-5.1-Codex-Mini is a smaller and faster version of GPT-5.1-Codex

test case generation and test code writing

1 shared capability

Best For

✓researchers validating AI code generation models
✓developers looking for robust evaluation metrics

Known Limitations

⚠Test cases may not cover all possible edge cases, leading to potential gaps in evaluation
⚠Requires significant computational resources for extensive testing

Requirements

Python 3.7+Access to the EvalPlus dataset

Input / Output

Accepts: code

Produces: structured data, evaluation metrics

UnfragileRank

Adoption80%(25% weight)

Quality17%(35% weight)

Ecosystem52%(15% weight)

Match Graph25%(20% weight)

Freshness75%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Benchmark

1 capabilities

Visit EvalPlus→

About

Alternatives to EvalPlus

SWE-bench48Benchmark

Real-world software engineering task evaluation suite

Compare →

HumanEval47Benchmark

OpenAI's standard for evaluating code generation models

Compare →

MBPP43Benchmark

Mostly Basic Programming Problems (beginner-friendly code)

Compare →

LiveCodeBench43Benchmark

Live coding benchmark with recent LeetCode problems

Compare →

Are you the builder of EvalPlus?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

papers with code

Looking for something else?

Search →

EvalPlus

Capabilities1 decomposed

extended test case generation for code evaluation

Related Artifactssharing capabilities

Mutable AI

Lingma - Alibaba Cloud AI Coding Assistant

YCombinator

Codegen

Codex

OpenAI: GPT-5.1-Codex-Mini

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to EvalPlus

Are you the builder of EvalPlus?

Get the weekly brief

Data Sources

EvalPlus

Capabilities1 decomposed

extended test case generation for code evaluation

Related Artifactssharing capabilities

Mutable AI

Lingma - Alibaba Cloud AI Coding Assistant

YCombinator

Codegen

Codex

OpenAI: GPT-5.1-Codex-Mini

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to EvalPlus

Are you the builder of EvalPlus?

Get the weekly brief

Data Sources