Issue Difficulty Classification And Stratification

1

SWE-benchBenchmark63/100

AI coding agent benchmark — real GitHub issues, end-to-end evaluation, the standard for code agents.

Unique: Automatically classifies instance difficulty based on objective metrics (lines changed, files modified) rather than manual annotation, enabling scalable stratification without human effort. This allows analysis of agent performance across difficulty levels without requiring subjective difficulty labels.

vs others: More scalable than manual difficulty annotation because it uses objective metrics, and more nuanced than single aggregate metrics because it reveals how agent performance varies with problem complexity.

2

MATH BenchmarkBenchmark63/100

via “problem difficulty level annotation and stratification”

12.5K competition math problems — AMC/AIME/Olympiad level, 7 subjects, standard math benchmark.

Unique: Provides difficulty stratification based on official competition sources (AMC 10 is easier than AIME which is harder than Olympiad), enabling researchers to analyze whether models scale their reasoning capabilities with problem difficulty. This reveals whether models have robust reasoning or merely memorized easy problem patterns.

vs others: More principled than arbitrary difficulty scoring because it leverages established competition hierarchies, but less precise than learned difficulty metrics based on empirical model performance data.

3

APPS (Automated Programming Progress Standard)Dataset56/100

via “difficulty-stratified problem categorization and filtering”

10K coding problems across 3 difficulty levels with test suites.

Unique: Explicitly stratifies problems into three difficulty tiers with substantial size per tier (3.6K, 5K, 1.4K), enabling fine-grained analysis of model performance degradation across skill levels rather than treating all problems as equal difficulty

vs others: Unlike HumanEval which lacks difficulty stratification, APPS enables researchers to measure whether models have genuine reasoning or are pattern-matching, by comparing performance across tiers

Top Matches

Also Known As

Company