via “model calibration measurement with multiple metrics and binning strategies”
57-subject knowledge benchmark — 15K+ questions across STEM, humanities, professional domains.
Unique: Implements five distinct calibration metrics (ECE, SCE, RMSCE, ACE, TACE) with pluggable binning strategies (uniform, adaptive) and normalization methods, enabling comprehensive calibration analysis beyond single-metric approaches. The modular architecture allows researchers to experiment with different calibration definitions and binning strategies without reimplementing core logic.
vs others: Provides multiple calibration metrics and binning strategies compared to single-metric approaches (e.g., ECE only), enabling more nuanced understanding of model confidence reliability and detection of calibration issues that single metrics might miss.