Capability
2 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →via “difficulty-stratified problem categorization and filtering”
10K coding problems across 3 difficulty levels with test suites.
Unique: Explicitly stratifies problems into three difficulty tiers with substantial size per tier (3.6K, 5K, 1.4K), enabling fine-grained analysis of model performance degradation across skill levels rather than treating all problems as equal difficulty
vs others: Unlike HumanEval which lacks difficulty stratification, APPS enables researchers to measure whether models have genuine reasoning or are pattern-matching, by comparing performance across tiers
via “difficulty-stratified problem sampling and filtering”
12.5K competition math problems across 7 subjects and 5 difficulty levels.
Unique: Pre-assigned difficulty metadata (1-5 scale) from competition context enables efficient filtering without re-evaluation, unlike datasets where difficulty must be computed post-hoc. Difficulty labels are grounded in actual competition difficulty (AMC problems are easier, AIME problems are harder), providing meaningful stratification.
vs others: More efficient than datasets requiring dynamic difficulty estimation because filtering is O(1) lookup on metadata; more reliable than model-specific difficulty metrics because it uses competition-grounded labels that generalize across model architectures.
Building an AI tool with “Difficulty Stratified Problem Sampling And Filtering”?
Submit your artifact →curl unfragile.ai/agents.md | sh© 2026 Unfragile. The platform for software for agents.