AgentBench
AgentFreeA Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
Capabilities16 decomposed
multi-environment llm agent evaluation across 8 standardized task domains
Medium confidenceEvaluates LLMs as autonomous agents across 8 distinct environments (OS, DB, KG, DCG, LTP, HH, WS, WB) using a standardized Task Interface that defines sample retrieval, execution, and metric calculation. The framework abstracts environment-specific logic behind a common contract, enabling systematic comparison of agent performance across heterogeneous task types with environment-specific startup times (5s-5min) and resource requirements (500MB-15GB). Agents interact with tasks through multi-turn Session management that tracks conversation history and message exchange.
First benchmark framework specifically designed for LLM agents (not just language tasks) with 8 diverse environments spanning command-line, database, knowledge graphs, games, and web interaction. Uses standardized Task Interface abstraction to enable environment-agnostic agent evaluation while preserving environment-specific metrics and startup characteristics.
Broader environment coverage than HELM (which focuses on language tasks) and more systematic than ad-hoc agent evaluation, with standardized interfaces enabling reproducible comparison across heterogeneous task domains.
standardized task interface for defining benchmark environments
Medium confidenceProvides a contract-based Task interface that all benchmark environments implement, defining methods for retrieving sample indices, executing individual samples with agent interactions, and calculating overall performance metrics. The interface abstracts environment-specific logic (game engines, database systems, web simulators) behind common method signatures, enabling the framework to orchestrate agent evaluation without coupling to particular environment implementations. Each task environment implements sample retrieval, step-by-step execution with agent actions, and metric aggregation.
Uses a minimal but comprehensive Task interface contract (get_indices, execute, get_metrics) that abstracts away environment-specific complexity while preserving the ability to implement domain-specific logic. Enables 8 diverse environments (game engines, databases, web simulators) to coexist under a single evaluation framework.
More flexible than monolithic benchmarks like GLUE (which hardcode specific tasks) because new environments can be added by implementing a single interface, not by modifying core evaluation logic.
web shopping task environment with e-commerce interaction simulation
Medium confidenceProvides a web shopping task environment where agents interact with a simulated e-commerce platform to complete shopping tasks (product search, comparison, purchase). Agents navigate product catalogs, read descriptions and reviews, manage shopping carts, and complete transactions through a web interface. The environment simulates realistic e-commerce workflows with product filtering, price comparison, and checkout processes. Tasks evaluate agent capabilities in information seeking, decision-making under uncertainty, and multi-step task completion in a complex web environment (~15GB resource requirement).
Integrates a full e-commerce simulation (WebShop-based) into AgentBench, enabling agents to complete realistic shopping tasks with product search, comparison, and purchase workflows. Agents must navigate complex web interfaces and make decisions based on product information and constraints.
More realistic than synthetic shopping tasks because it simulates actual e-commerce workflows with product catalogs and checkout processes, but more controlled than real websites due to simulation.
web browsing task environment with multi-page navigation and information retrieval
Medium confidenceProvides a web browsing task environment where agents navigate websites to find information and complete web-based tasks. Agents interact with a simulated web browser, following links, reading page content, and performing searches to locate specific information. The environment simulates realistic web navigation with multiple pages, search results, and information density variations. Tasks evaluate agent capabilities in web navigation, information retrieval, and multi-step task completion in open-ended web environments (~1GB resource requirement, ~5min startup).
Integrates a web browsing simulation (Mind2Web-based) into AgentBench, enabling agents to navigate multi-page websites and retrieve information through realistic web interactions. Agents must compose search queries, follow links, and extract relevant information from diverse page layouts.
More realistic than single-page information retrieval because it requires multi-step navigation and search, but more controlled than real web browsing due to simulation and limited page corpus.
household task environment with interactive home simulation (alfworld-based)
Medium confidenceProvides a household task environment where agents complete domestic tasks in a simulated home environment (based on ALFWorld). Agents interact with a text-based or visual home simulator, manipulating objects, navigating rooms, and completing household chores (cooking, cleaning, organizing). The environment simulates realistic household physics and object interactions, requiring agents to reason about spatial relationships, object properties, and task decomposition. Tasks evaluate agent capabilities in embodied reasoning, multi-step task planning, and interactive problem-solving.
Integrates a household task simulation (ALFWorld-based) into AgentBench, enabling agents to complete domestic tasks requiring spatial reasoning, object manipulation, and multi-step planning. Agents must understand household physics and decompose complex chores into executable actions.
More embodied than text-only task planning because agents must reason about spatial relationships and object interactions, but more abstract than visual embodied AI because it uses text descriptions rather than images.
lateral thinking puzzle task environment with constraint-based reasoning
Medium confidenceProvides a lateral thinking puzzle task environment where agents solve puzzles requiring creative, non-linear reasoning and constraint satisfaction. Agents interact with a puzzle system that presents scenarios, accepts guesses/hypotheses, and provides feedback on correctness. The environment manages puzzle state, constraint tracking, and solution validation. Tasks evaluate agent capabilities in creative problem-solving, hypothesis generation, constraint reasoning, and iterative refinement. Agents must think beyond obvious solutions and reason about implicit constraints.
Provides a lateral thinking puzzle environment that tests agent capabilities in creative, non-linear reasoning and constraint satisfaction. Puzzles require agents to think beyond obvious solutions and reason about implicit constraints, testing higher-order reasoning.
More challenging than standard reasoning benchmarks because lateral thinking puzzles require creative hypothesis generation and constraint reasoning, not just logical deduction.
digital card game task environment with strategic decision-making
Medium confidenceProvides a digital card game task environment where agents play strategic card games requiring decision-making, resource management, and opponent modeling. Agents receive game state information (hand, board, opponent state), select actions (play cards, attack, defend), and observe game outcomes. The environment manages game rules, turn order, win conditions, and card interactions. Tasks evaluate agent capabilities in strategic reasoning, resource optimization, and decision-making under uncertainty. Agents must balance multiple objectives and adapt strategies based on game state.
Provides a digital card game environment that tests agent capabilities in strategic reasoning, resource management, and decision-making under uncertainty. Agents must evaluate multiple card options and adapt strategies based on evolving game state.
More complex than simple turn-based games because card games introduce resource constraints, card interactions, and strategic depth, testing more sophisticated reasoning than single-action decisions.
configuration-driven task and agent setup with yaml/json specifications
Medium confidenceProvides a configuration system that enables users to define task environments, agent parameters, and evaluation assignments through YAML or JSON configuration files. The configuration system abstracts away code-level customization, enabling non-developers to set up benchmarks by editing configuration files. Supports task-specific parameters (environment type, sample count, resource limits), agent-specific parameters (model, temperature, prompt template), and assignment-level parameters (worker count, timeout). Configuration validation ensures correctness before execution.
Provides a configuration-driven setup system that separates benchmark specification from code, enabling non-developers to set up evaluations and researchers to share reproducible configurations. Supports task, agent, and assignment-level configuration.
More accessible than code-based setup because configuration files are human-readable and don't require programming knowledge, but less flexible than programmatic APIs for advanced customization.
agent interface with standardized decision-making and session communication
Medium confidenceDefines a standardized Agent interface that abstracts how LLMs and other decision-makers interact with task environments through a Session communication channel. Agents receive observations from tasks, generate actions, and receive feedback in a multi-turn loop. The interface supports both sophisticated LLM-based agents (with prompt engineering, chain-of-thought reasoning) and naive rule-based agents, enabling comparison of different agent architectures. Session management tracks conversation history and message exchange, providing agents with context for decision-making.
Provides a unified Agent interface that supports both LLM-based agents (with arbitrary prompt engineering and reasoning strategies) and naive baseline agents, enabling architectural comparison. Session management preserves conversation history, allowing agents to leverage multi-turn context for improved decision-making.
More general than task-specific agent implementations because the same Agent interface works across all 8 environments without modification, unlike custom agent code per task.
session-based multi-turn conversation management between agents and tasks
Medium confidenceImplements a Session abstraction that manages the communication channel between agents and task environments, handling message exchange, conversation history tracking, and state synchronization across multiple turns. Sessions maintain a chronological record of agent observations, actions, and task feedback, enabling agents to make decisions based on accumulated context. The Session interface standardizes how agents receive observations and submit actions, decoupling agent logic from environment-specific communication protocols.
Provides a lightweight Session abstraction that decouples conversation management from environment-specific logic, enabling agents to interact with heterogeneous environments (databases, games, web) through a unified message-passing interface. Preserves full conversation history for post-hoc analysis.
Simpler than full dialogue state tracking systems (like DSTC) because it doesn't require semantic slot extraction, just message sequencing and history preservation.
task controller orchestration with distributed task execution and resource management
Medium confidenceImplements a Task Controller that orchestrates the execution of benchmark tasks across multiple workers, managing resource allocation, task assignment, and result aggregation. The controller uses a Task Assigner to distribute samples across workers and a pool of Task Workers to execute agent-task interactions in parallel. This architecture enables efficient evaluation of agents across large sample sets while managing system resources (memory, CPU, disk) and handling task startup/teardown. The controller coordinates the lifecycle of task environments (initialization, sample execution, metric calculation, cleanup).
Uses a Task Controller + Task Assigner + Task Workers pattern to distribute benchmark evaluation across multiple processes while managing heterogeneous task startup times (5s-5min) and resource requirements (500MB-15GB). Abstracts away parallelization complexity from task and agent implementations.
More sophisticated than sequential evaluation because it amortizes task startup overhead across multiple samples and enables parallel execution, but simpler than full distributed systems (no network communication, single-machine focus).
environment-specific metric calculation and performance aggregation
Medium confidenceProvides a standardized Evaluation Metrics subsystem where each task environment implements domain-specific metric calculation (e.g., success rate for games, SQL correctness for databases, task completion for household tasks). The framework aggregates per-sample metrics into overall performance scores while preserving environment-specific semantics. Metrics are calculated after task execution completes, enabling post-hoc analysis and comparison across agents. The metric interface supports both binary success indicators and continuous performance scores.
Implements environment-specific metric calculation that preserves domain semantics (e.g., game win rate, SQL query correctness, household task completion) rather than forcing all tasks into a single metric space. Enables meaningful performance comparison within each domain while acknowledging that cross-domain comparison requires careful interpretation.
More nuanced than single-metric benchmarks (like GLUE's average score) because it respects the different success criteria across diverse task types, but requires more sophisticated analysis to compare across domains.
avalon game environment with strategic multi-agent gameplay simulation
Medium confidenceImplements a complete Avalon card game environment where LLM agents play a social deduction game requiring strategic reasoning, communication, and deception detection. The environment includes a game engine that manages game state, turn order, voting mechanics, and win conditions, while agents interact through natural language communication and action selection. The Avalon task evaluates agent capabilities in multi-agent strategic reasoning, persuasion, and information inference from incomplete information. Agents must balance exploration (gathering information) with exploitation (making winning moves).
Provides a complete Avalon game engine integrated into AgentBench, enabling evaluation of LLM agents in a complex multi-agent strategic environment with hidden information, voting mechanics, and social deduction elements. Agents must reason about other players' strategies and communicate persuasively.
More sophisticated than simple turn-based games because Avalon requires reasoning about hidden information and other agents' beliefs, testing higher-order reasoning capabilities than single-player tasks.
operating system command-line task environment with linux shell interaction
Medium confidenceProvides a Linux OS command-line task environment where agents interact with a shell interface to complete system administration and file manipulation tasks. Agents receive shell prompts, issue commands, and observe command output in a multi-turn interaction loop. The environment manages a sandboxed Linux filesystem and command execution, enabling safe evaluation of agent capabilities in command-line reasoning and system administration. Tasks include file operations, text processing, system queries, and scripting.
Integrates a sandboxed Linux shell environment into AgentBench, enabling agents to interact with real command-line interfaces while maintaining safety through filesystem isolation. Agents must reason about command syntax, output interpretation, and multi-step task decomposition.
More realistic than synthetic command-line simulators because it uses actual Linux shells and commands, but more controlled than unrestricted system access due to sandboxing.
database sql query task environment with schema-aware interaction
Medium confidenceProvides a database task environment where agents interact with SQL databases to complete data querying and manipulation tasks. Agents receive database schemas, issue SQL queries, and observe query results in a multi-turn loop. The environment manages a sandboxed database instance with predefined schemas and data, enabling evaluation of agent capabilities in SQL reasoning, schema understanding, and query composition. Tasks include data retrieval, aggregation, filtering, and complex joins.
Integrates a sandboxed database environment into AgentBench with schema-aware interaction, enabling agents to reason about relational structures and compose SQL queries. Agents must understand database semantics and handle SQL errors gracefully.
More realistic than text-based SQL reasoning tasks because agents interact with actual database systems and receive real query results, but more controlled than production databases due to sandboxing and predefined schemas.
knowledge graph querying and reasoning task environment
Medium confidenceProvides a knowledge graph task environment where agents query and reason over structured knowledge representations to answer questions and complete reasoning tasks. Agents interact with a knowledge graph API, issuing queries to retrieve entities, relationships, and perform multi-hop reasoning. The environment manages a sandboxed knowledge graph with predefined entities and relationships, enabling evaluation of agent capabilities in semantic reasoning, relationship inference, and multi-step knowledge navigation. Tasks include entity lookup, relationship discovery, and transitive reasoning.
Integrates a knowledge graph environment into AgentBench, enabling agents to perform multi-hop reasoning and semantic inference over structured knowledge. Agents must navigate entity-relationship structures and compose multi-step reasoning chains.
More structured than free-text QA tasks because knowledge graphs provide explicit relationships, but more challenging than single-hop lookups because agents must reason across multiple hops.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with AgentBench, ranked by overlap. Discovered automatically through the match graph.
AgentBench
8-environment benchmark for evaluating LLM agents.
WebArena
Realistic web environment for autonomous agent testing.
ReAct: Synergizing Reasoning and Acting in Language Models (ReAct)
* ⭐ 11/2022: [BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (BLOOM)](https://arxiv.org/abs/2211.05100)
OSWorld
Real OS benchmark for multimodal computer agents.
HELM
Stanford's holistic LLM evaluation — 42 scenarios, 7 metrics including fairness, bias, toxicity.
TaskWeaver
The first "code-first" agent framework for seamlessly planning and executing data analytics tasks.
Best For
- ✓LLM researchers evaluating agent capabilities across diverse domains
- ✓Teams comparing proprietary vs open-source LLM performance as agents
- ✓Organizations building agent systems and needing standardized benchmarks
- ✓Researchers extending AgentBench with custom task environments
- ✓Framework maintainers ensuring consistency across 8+ task implementations
- ✓Teams building domain-specific agent benchmarks using AgentBench patterns
- ✓Researchers evaluating LLM agents' web navigation and decision-making capabilities
- ✓Teams testing agent e-commerce task completion and information seeking
Known Limitations
- ⚠Web Shopping and Web Browsing environments require 15GB and 1GB respectively, limiting local evaluation
- ⚠Startup times vary significantly (5s-5min), making batch evaluation of many samples time-intensive
- ⚠Environment-specific metrics are not directly comparable across domains, requiring separate analysis per task type
- ⚠No built-in support for custom evaluation metrics beyond environment-provided ones
- ⚠Interface abstraction may hide important environment-specific details, requiring documentation per task
- ⚠Metric calculation must be implemented per-environment, preventing cross-environment metric comparison
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
Repository Details
Last commit: Feb 8, 2026
About
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
Categories
Alternatives to AgentBench
Are you the builder of AgentBench?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →