Capability
Human Performance Anchored Difficulty Calibration
3 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “human-performance-anchored difficulty calibration”
44K pronoun resolution problems testing commonsense understanding.
Unique: Establishes 94% human performance as an explicit calibration anchor through expert annotation, enabling quantitative model-human comparison rather than abstract performance claims; this anchor is embedded in dataset metadata and evaluation harnesses
vs others: More interpretable than relative benchmarks (e.g., 'better than GPT-3') because human performance provides an absolute reference point; more rigorous than datasets without human baselines where model performance claims lack grounding