Capability

Benchmark Dataset For Dialogue Model Evaluation

12 artifacts provide this capability.

Want a personalized recommendation?

Top Matches

via “large-scale benchmark dataset with 44k examples”

44K pronoun resolution problems testing commonsense understanding.

Unique: Scales to 44,000 examples (vs 273 in original Winograd Schema Challenge) while maintaining adversarial filtering, enabling statistically robust model comparison and detection of small performance differences that would be noise in smaller benchmarks

vs others: Larger than original Winograd Schema Challenge (273 examples) enabling tighter confidence intervals; smaller than full coreference datasets (OntoNotes ~3.6M tokens) but more focused on commonsense reasoning than general coreference

Benchmark Dataset For Dialogue Model Evaluation

Top Matches

Also Known As

Company