OSWorld
BenchmarkFreeReal OS benchmark for multimodal computer agents.
Capabilities12 decomposed
real-environment gui interaction evaluation
Medium confidenceEvaluates multimodal agents' ability to interact with actual operating system graphical interfaces across Ubuntu, Windows, and macOS by executing tasks that require screenshot understanding, mouse/keyboard simulation, and application navigation. Uses custom execution-based evaluation scripts per task that capture initial OS state, execute agent actions, and verify task completion against ground truth outcomes in real sandboxed environments.
Executes tasks on actual operating systems (Ubuntu, Windows, macOS) with custom per-task evaluation scripts rather than simulated environments or synthetic UI frameworks. Grounds agent evaluation in real application behavior, file I/O, and OS-level state changes, capturing the complexity of multi-app workflows and GUI grounding that synthetic benchmarks cannot replicate.
More realistic than simulated GUI benchmarks (e.g., WebShop, MiniWoB) because it tests against actual OS behavior and real applications, but requires significantly more computational infrastructure than synthetic alternatives, making it less accessible for individual researchers.
multi-os task distribution and evaluation
Medium confidenceDistributes 369 benchmark tasks across three operating systems (Ubuntu, Windows, macOS) with OS-specific initial state configurations and evaluation scripts. Each task includes a detailed setup configuration that establishes the OS environment, file structures, and application states before agent execution, enabling reproducible evaluation of agent performance across platform-specific UI paradigms and application ecosystems.
Includes OS-specific initial state setup configurations and custom evaluation scripts per task, rather than a single generic task definition. This approach captures OS-level differences in file systems, UI paradigms, and application ecosystems, but requires maintaining three parallel task implementations and evaluation harnesses.
More comprehensive than single-OS benchmarks because it tests cross-platform generalization, but significantly increases benchmark maintenance burden and infrastructure requirements compared to OS-agnostic synthetic benchmarks.
gui grounding and visual understanding evaluation
Medium confidenceEvaluates agent capability to understand and interact with graphical user interfaces by analyzing screenshots and identifying UI elements, buttons, menus, and text fields. Tests agent ability to visually ground task instructions in the actual UI state, a capability identified as a key limitation in current agents.
Explicitly evaluates GUI grounding and visual understanding as a core agent capability, identifying it as a key limitation in current agents. This focuses evaluation on a specific bottleneck rather than treating visual understanding as a solved problem.
More targeted than generic multimodal benchmarks because it focuses on GUI understanding as a specific capability, but may not capture other important agent limitations like operational knowledge or task planning.
operational knowledge and application expertise evaluation
Medium confidenceEvaluates agent capability to understand how to use applications and perform operations within them, testing knowledge of application-specific workflows, menu structures, keyboard shortcuts, and domain-specific operations. Identified as a key limitation in current agents alongside GUI grounding.
Explicitly evaluates operational knowledge and application expertise as a core agent capability, identifying it as a key limitation in current agents. This tests agent capability to understand how to use applications, not just how to interact with GUIs.
More comprehensive than GUI-only benchmarks because it tests both visual understanding and operational knowledge, but harder to diagnose which capability is limiting agent performance.
custom execution-based task evaluation
Medium confidenceImplements task-specific evaluation scripts that execute agent actions against real OS state and verify completion by checking file system changes, application state modifications, and other observable outcomes. Each of the 369 tasks includes a custom evaluation script that defines success criteria, captures execution traces, and produces reproducible verdicts independent of agent architecture or implementation details.
Uses custom per-task evaluation scripts rather than generic scoring functions, enabling task-specific success criteria that capture domain knowledge (e.g., correct file format, application-specific state changes). This approach is more accurate than generic metrics but requires significant engineering effort and domain expertise per task.
More accurate than generic scoring functions for complex, multi-step tasks, but less scalable and harder to maintain than standardized evaluation metrics used in simpler benchmarks.
real-world task scenario grounding
Medium confidenceGrounds benchmark tasks in real-world computer use cases derived from actual user workflows, file management operations, application usage patterns, and multi-app interactions. Tasks are not synthetic or artificially constructed but represent genuine computer tasks that users perform, including file organization, document editing, web browsing, email management, and cross-application data workflows.
Tasks are derived from real-world computer use cases rather than synthetic or artificially constructed scenarios, aiming to evaluate agent capability on tasks that users actually perform. This grounds evaluation in practical utility but introduces data contamination risks and makes it harder to control task difficulty and distribution.
More practically relevant than synthetic benchmarks (e.g., WebShop, MiniWoB) because tasks represent actual user workflows, but less controlled and harder to validate than carefully constructed synthetic tasks with known difficulty and no training data overlap.
multimodal agent performance benchmarking
Medium confidenceProvides standardized evaluation infrastructure for measuring multimodal agent performance (combining vision and language understanding) on computer task completion. Establishes baseline human performance (72.36% success rate) and current state-of-the-art model performance (12.24% success rate), quantifying the gap between human and AI agent capability on real OS tasks.
Establishes quantified baseline performance (human 72.36% vs SOTA 12.24%) on real OS tasks, creating a measurable target for agent improvement. The large gap indicates substantial room for progress and highlights specific capability gaps (GUI grounding, operational knowledge) that agents need to address.
More realistic performance measurement than synthetic benchmarks because it uses real OS environments and real-world tasks, but the 60+ percentage point gap between human and SOTA performance suggests the benchmark may be too difficult to provide useful signal for incremental improvements.
interactive benchmark data viewer
Medium confidenceProvides a web-based interactive viewer for exploring benchmark tasks, initial states, expected outcomes, and evaluation results. Enables researchers and developers to inspect individual tasks, understand evaluation criteria, and analyze agent performance without requiring local execution of the full benchmark infrastructure.
Provides interactive web-based exploration of benchmark tasks and results rather than requiring local data access or command-line tools. Lowers barrier to entry for researchers who want to understand benchmark tasks without setting up evaluation infrastructure.
More accessible than command-line or programmatic data access, but potentially less powerful for bulk analysis or custom queries compared to direct data access.
aws-accelerated benchmark evaluation
Medium confidenceIntegrates with AWS infrastructure to accelerate benchmark evaluation, reducing full benchmark execution time to approximately 1 hour (as of 2025-07-28 update). Leverages cloud VM provisioning and parallel task execution to speed up evaluation compared to local execution, enabling faster iteration and result collection.
Integrates AWS cloud infrastructure to parallelize benchmark evaluation and reduce execution time to ~1 hour, rather than requiring local VM execution. This is a recent improvement (2025-07-28) that suggests previous evaluation was significantly slower.
Faster than local evaluation for teams with AWS access, but adds cloud provider dependency and cost compared to fully local benchmarking.
benchmark versioning and continuous improvement
Medium confidenceMaintains versioned benchmark releases with documented improvements and bug fixes. The 2025-07-28 update introduced 'OSWorld-Verified' with comprehensive improvements including community-reported example fixes and AWS acceleration, indicating active maintenance and responsiveness to feedback.
Actively maintains and improves benchmark with documented versions and community-driven bug fixes, rather than releasing a static benchmark. The 2025-07-28 'OSWorld-Verified' update indicates responsiveness to community feedback and ongoing refinement.
More maintainable and trustworthy than static benchmarks because improvements are tracked and documented, but requires users to specify version for reproducibility and may introduce incompatibilities between versions.
open-source benchmark infrastructure
Medium confidenceProvides open-source access to benchmark code, evaluation scripts, task data, and documentation, enabling independent verification, extension, and reproduction of benchmark results. All components (code, documentation, data, viewer) are publicly available, supporting transparency and community contribution.
Releases all benchmark components (code, data, documentation, viewer) as open-source rather than proprietary, enabling independent verification and community contributions. This transparency is unusual for benchmarks but increases trust and enables broader adoption.
More transparent and reproducible than proprietary benchmarks, but requires more effort to maintain open-source infrastructure and may expose implementation details that could be exploited by agents trained specifically for the benchmark.
multi-application workflow evaluation
Medium confidenceEvaluates agent capability on tasks requiring interaction across multiple applications and OS-level file I/O operations, not just single-application tasks. Tasks include workflows that span web browsers, desktop applications, file managers, and system utilities, testing agent ability to coordinate actions across application boundaries and manage cross-app data flow.
Includes tasks requiring coordination across multiple applications and OS-level file I/O, rather than focusing on single-application tasks. This tests agent capability on realistic workflows but significantly increases task complexity and evaluation difficulty.
More realistic than single-application benchmarks because it tests cross-app coordination, but significantly harder to evaluate and debug because failures can stem from issues in any of multiple applications or their interactions.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with OSWorld, ranked by overlap. Discovered automatically through the match graph.
ByteDance: UI-TARS 7B
UI-TARS-1.5 is a multimodal vision-language agent optimized for GUI-based environments, including desktop interfaces, web browsers, mobile systems, and games. Built by ByteDance, it builds upon the UI-TARS framework with reinforcement...
MobileAgent
Mobile-Agent: The Powerful GUI Agent Family
HTTPie AI
Revolutionizes API testing with AI, intuitive GUI, and cross-platform...
UI-TARS-desktop
The Open-Source Multimodal AI Agent Stack: Connecting Cutting-Edge AI Models and Agent Infra
Agent-S
Agent S: an open agentic framework that uses computers like a human
AIForge
🚀 智能意图自适应执行引擎,只需一句话,让AI帮你搞定想做的事(数据分析与处理、高时效性内容创作、最新信息获取、数据可视化、系统交互、自动化工作流、代码开发等)
Best For
- ✓AI research teams developing multimodal agents and evaluating GUI understanding capabilities
- ✓Companies building autonomous desktop automation tools and needing realistic performance baselines
- ✓Researchers studying human-computer interaction and agent behavior in real OS environments
- ✓Teams building cross-platform automation tools who need OS-agnostic agent evaluation
- ✓Researchers studying how agent architecture and training data affect OS-specific performance
- ✓Organizations deploying agents in heterogeneous enterprise environments with mixed OS deployments
- ✓Teams developing vision-language models for GUI understanding
- ✓Researchers studying visual grounding in multimodal agents
Known Limitations
- ⚠Evaluation requires actual OS execution in sandboxed VMs, making local evaluation computationally expensive and time-consuming (reduced to ~1 hour with AWS support as of 2025-07-28, but previously significantly longer)
- ⚠8 of 369 tasks excluded from usable benchmark due to network dependencies requiring manual configuration, reducing effective test set to 361 tasks
- ⚠No specification of train/dev/test split or data contamination analysis — tasks derived from real-world use cases may overlap with web-scraped LLM training data
- ⚠Scoring methodology not fully detailed in documentation — unclear whether success is binary, graduated, or includes partial credit; timeout thresholds not specified
- ⚠No failure mode analysis provided — unclear which task categories agents struggle with most (by OS, application type, or complexity)
- ⚠Task distribution across Ubuntu, Windows, and macOS not specified in documentation — unclear if tasks are balanced or skewed toward one OS
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Benchmark for evaluating multimodal agents on real computer tasks across Ubuntu, Windows, and macOS using actual operating systems, testing file management, application use, and multi-app workflows with screenshot understanding.
Categories
Alternatives to OSWorld
Are you the builder of OSWorld?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →