Capability
Benchmark Suite Composition And Aggregation
5 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
EleutherAI's evaluation framework — 200+ benchmarks, powers Open LLM Leaderboard.
Unique: Provides a declarative suite definition system where tasks can be grouped with optional weights and aggregation methods. The system automatically computes per-task and suite-level metrics, with confidence intervals propagated through aggregation. Supports both standard benchmarks (MMLU, BigBench) and custom suites defined in YAML or Python.
vs others: Supports weighted aggregation and custom suite composition, whereas alternatives typically report only per-task results; integrates suite definition into the evaluation framework rather than requiring external aggregation scripts