Capability

Human Performance Anchored Difficulty Calibration

3 artifacts provide this capability.

Want a personalized recommendation?

Top Matches

via “human-performance-anchored difficulty calibration”

44K pronoun resolution problems testing commonsense understanding.

Unique: Establishes 94% human performance as an explicit calibration anchor through expert annotation, enabling quantitative model-human comparison rather than abstract performance claims; this anchor is embedded in dataset metadata and evaluation harnesses

vs others: More interpretable than relative benchmarks (e.g., 'better than GPT-3') because human performance provides an absolute reference point; more rigorous than datasets without human baselines where model performance claims lack grounding

Human Performance Anchored Difficulty Calibration

Top Matches

Also Known As

Company