EvalPlus
BenchmarkFreeExtended code evaluation with harder test cases for HumanEval
Capabilities1 decomposed
extended test case generation for code evaluation
Medium confidenceEvalPlus enhances the HumanEval benchmark by providing additional, more challenging test cases for each of the original 164 problems, extending the evaluation scope to over 40 test cases per problem. This is achieved by systematically generating diverse edge cases and complex scenarios that challenge models to demonstrate true coding proficiency rather than simply overfitting to the original tests. The approach focuses on rigorous evaluation, ensuring that models are tested against a broader range of inputs and conditions, which is crucial for assessing their real-world applicability.
The unique aspect of EvalPlus lies in its systematic approach to generating a wide array of challenging test cases that extend beyond the original HumanEval, ensuring a more rigorous evaluation of model capabilities.
More comprehensive than standard benchmarks like HumanEval, as it includes a significantly larger and more challenging set of test cases.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with EvalPlus, ranked by overlap. Discovered automatically through the match graph.
Mutable AI
AI-Accelerated Software Development
Lingma - Alibaba Cloud AI Coding Assistant
Type Less, Code More
YCombinator
[Twitter](https://twitter.com/SecondDevHQ)
Codegen
Solve tickets, write tests, level up your workflow
Codex
Streamlines coding with AI-driven generation, debugging, and...
OpenAI: GPT-5.1-Codex-Mini
GPT-5.1-Codex-Mini is a smaller and faster version of GPT-5.1-Codex
Best For
- ✓researchers validating AI code generation models
- ✓developers looking for robust evaluation metrics
Known Limitations
- ⚠Test cases may not cover all possible edge cases, leading to potential gaps in evaluation
- ⚠Requires significant computational resources for extensive testing
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
EvalPlus extends HumanEval with additional test cases per problem (up to 40+ per problem). Same 164 problems but much harder test sets. Catches models that overfit to the original HumanEval test cases. Better for rigorous code evaluation.
Categories
Alternatives to EvalPlus
Are you the builder of EvalPlus?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →