Capability
Benchmark Dataset For Dialogue Model Evaluation
12 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “large-scale benchmark dataset with 44k examples”
44K pronoun resolution problems testing commonsense understanding.
Unique: Scales to 44,000 examples (vs 273 in original Winograd Schema Challenge) while maintaining adversarial filtering, enabling statistically robust model comparison and detection of small performance differences that would be noise in smaller benchmarks
vs others: Larger than original Winograd Schema Challenge (273 examples) enabling tighter confidence intervals; smaller than full coreference datasets (OntoNotes ~3.6M tokens) but more focused on commonsense reasoning than general coreference