via “real-world github issue-to-patch evaluation”
AI coding agent benchmark — real GitHub issues, end-to-end evaluation, the standard for code agents.
Unique: Uses real, unmodified GitHub issues from production repositories rather than synthetic or simplified tasks, capturing authentic complexity including ambiguous requirements, legacy code patterns, and multi-file dependencies that synthetic benchmarks miss. Includes full repository context and actual test suites, forcing agents to navigate real codebase structure rather than isolated code snippets.
vs others: More realistic than HumanEval or MBPP because it tests end-to-end issue resolution on production codebases rather than isolated function implementation, and more reproducible than ad-hoc evaluation because all 2,294 instances are version-controlled and standardized.