AgentBenchBenchmark40/100 via “multi-environment agent evaluation with standardized task interface”
8-environment benchmark for evaluating LLM agents.
Unique: First benchmark framework specifically designed for LLM agents with 8 diverse task environments spanning web, database, OS, and game domains. Uses a unified Task interface abstraction that allows heterogeneous environments (WebShop, Mind2Web, ALFWorld, custom games) to expose consistent sample/execute/metric APIs, enabling apples-to-apples agent comparison across fundamentally different interaction paradigms.
vs others: Broader environmental coverage than single-domain benchmarks (e.g., WebShop-only or OS-only) and more realistic than synthetic task collections, providing comprehensive agent capability assessment across real-world scenarios.