What can @github/computer-use-mcp do?

gui automation via standardized mcp protocol, screenshot capture with llm-compatible encoding, mouse control with absolute positioning, keyboard input with text and special key support, mcp server lifecycle and tool registration, agent-driven perception-action loop orchestration

@github/computer-use-mcp

MCP ServerFree

Computer Use MCP Server

Open Source

/ 100

6 capabilities

Capabilities6 decomposed

gui automation via standardized mcp protocol

Medium confidence

Exposes computer screen interaction (mouse, keyboard, screenshot capture) through the Model Context Protocol (MCP), enabling LLM agents to control desktop applications and web interfaces programmatically. Implements MCP server specification with tools for screenshot capture, mouse movement/clicking, and keyboard input, allowing any MCP-compatible client (Claude, custom agents) to orchestrate GUI interactions without direct OS-level bindings.

Solves for

I want my LLM agent to interact with desktop applications that don't have APIsI need to automate repetitive GUI workflows like form filling or data entry across multiple applicationsI want to give Claude or another LLM the ability to see and control my screen to complete tasks

Best for

AI agent developers building autonomous task automation systems

Teams integrating Claude with legacy or proprietary desktop software

Researchers prototyping LLM-driven UI automation without building custom integrations

Requires

Node.js 16+ (MCP server runtime)

MCP-compatible client (Claude API with MCP support, or custom agent framework)

Display server access (X11/Wayland on Linux, native on macOS/Windows)

Limitations

No built-in OCR — relies on LLM's vision capabilities to interpret screen content, limiting accuracy on complex layouts

Latency overhead from screenshot encoding/transmission per action cycle (typically 500ms-2s round-trip)

No native support for multi-monitor setups or window-specific targeting — operates on full screen coordinates only

What makes it unique

GitHub's implementation standardizes computer use as an MCP tool, enabling any MCP-compatible LLM client to control GUIs without custom integrations. Uses MCP's resource and tool abstractions to expose OS-level input/output as composable capabilities, rather than building a proprietary agent framework.

vs alternatives

Leverages MCP's standardization to work with any MCP client (Claude, custom agents) without vendor lock-in, whereas Anthropic's native computer-use API is Claude-specific and requires direct API integration

screenshot capture with llm-compatible encoding

Medium confidence

Captures the current display state and encodes it as base64-encoded image data (PNG/JPEG) compatible with multimodal LLM vision APIs. Implements efficient screenshot serialization that balances image quality with token efficiency, allowing LLMs to analyze screen content for decision-making in automation loops.

Solves for

I need my LLM agent to see what's currently on screen to decide what action to take nextI want to capture screen state at specific points in a workflow for debugging or loggingI need to feed visual context to a vision model to interpret UI elements or extract information

Best for

LLM agent developers building perception-action loops

Automation engineers debugging GUI interaction failures

Researchers studying LLM reasoning over visual UI state

Requires

Node.js 16+

Display server with screenshot capability (X11, Wayland, native macOS/Windows)

MCP client with image/base64 support in tool responses

Limitations

No selective region capture — always captures full screen, increasing token usage for large displays

Encoding overhead adds 100-300ms per screenshot depending on resolution and compression

No built-in image optimization or downsampling — relies on client-side compression or LLM token budgeting

What makes it unique

Encodes screenshots as base64 within MCP tool responses, making them directly consumable by multimodal LLMs without separate file I/O or external image hosting. Integrates screenshot capture as a first-class MCP tool rather than a side-channel.

vs alternatives

Simpler integration than Anthropic's computer-use API because it uses standard MCP tool responses; no special image handling protocol needed, just base64 encoding in tool output

mouse control with absolute positioning

Medium confidence

Enables LLM agents to move the mouse cursor to absolute screen coordinates and perform click actions (left, right, double-click). Implements coordinate-based input without relative motion or gesture support, requiring the agent to calculate target positions based on visual feedback from screenshots.

Solves for

I want my agent to click on UI elements it identified in a screenshotI need to move the mouse to a specific location before typing or performing another actionI want to simulate right-click context menus for accessing application features

Best for

Automation developers building click-based workflows on web and desktop UIs

Teams automating data entry or form submission across applications

Researchers testing LLM spatial reasoning on screen coordinates

Requires

Node.js 16+

OS input simulation permissions (may require elevated privileges on some systems)

MCP client capable of sending tool calls with numeric coordinate parameters

Limitations

Absolute positioning only — no relative motion, drag-and-drop, or gesture support (swipes, pinches)

No collision detection or validation — agent can click on invalid coordinates without feedback

Coordinate system is global screen space; no window-relative or element-relative targeting

What makes it unique

Exposes mouse control as discrete MCP tools (move, click) with absolute coordinate parameters, allowing agents to compose clicks with screenshot analysis in a tight perception-action loop. No gesture or drag abstractions — forces explicit coordinate calculation.

vs alternatives

More granular than high-level UI automation frameworks (Selenium, Playwright) because it operates at raw input level; more flexible for non-web UIs but requires agent to handle coordinate math

keyboard input with text and special key support

Medium confidence

Allows LLM agents to send keyboard input including text strings and special keys (Enter, Tab, Escape, arrow keys, etc.) to the focused application. Implements key event simulation at the OS level, enabling agents to type into forms, navigate menus, and trigger keyboard shortcuts without requiring visual feedback between keystrokes.

Solves for

I want my agent to type text into a form field or search boxI need to send keyboard shortcuts (Ctrl+C, Cmd+V) to interact with applicationsI want to navigate UI menus using arrow keys and Enter

Best for

Automation engineers building text-input workflows (form filling, search, code entry)

Teams automating keyboard-driven applications (terminals, IDEs, legacy software)

Researchers studying LLM text generation for UI interaction

Requires

Node.js 16+

OS input simulation permissions

MCP client capable of sending tool calls with string and key parameters

Limitations

No keyboard state awareness — agent cannot detect if Caps Lock or Num Lock is active

No input validation or error recovery — typing into wrong field fails silently without feedback

Special key support depends on OS and keyboard layout; non-ASCII characters may not work reliably

What makes it unique

Integrates keyboard input as MCP tools with support for both text strings and named special keys, allowing agents to compose typing actions with screenshot analysis. Handles modifier keys as part of key names rather than separate state.

vs alternatives

More flexible than web automation tools (Selenium) for non-web applications; simpler than low-level keyboard event APIs because it abstracts key name resolution and modifier handling

mcp server lifecycle and tool registration

Medium confidence

Implements the MCP server specification, registering screenshot, mouse, and keyboard tools as discoverable capabilities that MCP clients can invoke. Handles MCP protocol handshake, tool schema definition, and request/response serialization, enabling any MCP-compatible client to discover and call computer-use tools without hardcoding tool names.

Solves for

I want to integrate computer-use capabilities into my MCP-compatible agent frameworkI need my LLM client to discover available GUI automation tools dynamicallyI want to build a custom agent that uses MCP tools for computer control

Best for

MCP client developers (Claude API, custom agent frameworks) integrating computer use

Teams building standardized agent platforms that support multiple tool providers

Researchers prototyping LLM agent architectures with pluggable tool systems

Requires

Node.js 16+

MCP client library or framework (e.g., @anthropic-sdk/sdk with MCP support)

Understanding of MCP protocol and tool schema format

Limitations

MCP protocol overhead adds latency to each tool invocation (typically 50-200ms per round-trip)

Tool schema must be statically defined at server startup — no dynamic tool registration based on runtime state

No built-in authentication or access control — assumes trusted client environment

What makes it unique

Implements MCP server specification for computer use, making GUI automation tools discoverable and composable within any MCP ecosystem. Uses MCP's tool schema system to define screenshot, mouse, and keyboard as standardized, versioned capabilities.

vs alternatives

Standardizes computer use as MCP tools rather than a proprietary API, enabling interoperability across different LLM clients and agent frameworks; more flexible than Anthropic's native computer-use API which is Claude-specific

agent-driven perception-action loop orchestration

Medium confidence

Enables LLM agents to execute multi-step automation workflows by composing screenshot analysis with mouse/keyboard actions in tight feedback loops. The agent perceives screen state via screenshots, reasons about next actions, and executes them via mouse/keyboard tools, repeating until task completion. Supports iterative refinement where agents can correct mistakes by taking new screenshots and adjusting subsequent actions.

Solves for

I want my agent to complete a multi-step task like filling a form, submitting it, and verifying the resultI need my agent to recover from mistakes by detecting failures in screenshots and retrying with different actionsI want to build a workflow that adapts to dynamic UI changes by re-analyzing the screen state

Best for

Automation engineers building resilient, adaptive workflows for complex applications

Teams automating business processes that require visual feedback and error recovery

Researchers studying LLM reasoning and planning in interactive environments

Requires

Node.js 16+

MCP-compatible LLM client with vision capabilities (Claude 3.5+)

Sufficient LLM context window to maintain task state across multiple action cycles

Limitations

Latency compounds with loop iterations — each screenshot + action cycle adds 500ms-2s, making long workflows slow

No built-in state machine or workflow definition — agent must maintain task context and progress in its reasoning

No timeout or loop-break mechanisms — agent can get stuck in infinite loops if it misinterprets screen state

What makes it unique

Enables agents to orchestrate perception-action loops by composing MCP tools (screenshot, mouse, keyboard) without explicit workflow definition. Relies on LLM reasoning to maintain task context and decide when to stop, rather than using state machines or explicit loop control.

vs alternatives

More flexible than RPA tools (UiPath, Blue Prism) because it uses LLM reasoning for adaptation; simpler than building custom agent frameworks because it leverages MCP's tool abstraction

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with @github/computer-use-mcp, ranked by overlap. Discovered automatically through the match graph.

MCP Server23

@atomicbotai/computer-use-mcp

MCP server exposing desktop computer-use as an MCP tool

mouse-control-with-coordinate-targetingdesktop-automation-via-mcp-protocolcross-platform-input-abstraction

3 shared capabilities

CLI Tool42

Open Interpreter

Natural language computer interface — runs local code to accomplish tasks, like local Code Interpreter.

mouse and keyboard control with coordinate-based interactioncomputer vision and screenshot capture for ui automation

2 shared capabilities

MCP Server27

just-every/mcp-screenshot-website-fast

** - High-quality screenshot capture optimized for Claude Vision API. Automatically tiles full pages into 1072x1072 chunks (1.15 megapixels) with configurable viewports and wait strategies for dynamic content.

mcp protocol integration with stdio json-rpc transportcli binary interface with direct command-line screenshot execution

2 shared capabilities

MCP Server21

gmod-mcp

MCP tool for Garry's Mod: RCON, Lua execution, window screenshot/control, and SFTP file management

game-window-interaction-and-controlgame-window-screenshot-capture

2 shared capabilities

MCP Server27

@hisma/server-puppeteer

Fork and update (v0.6.5) of the original @modelcontextprotocol/server-puppeteer MCP server for browser automation using Puppeteer.

page-screenshot-and-visual-captureheadless-browser-automation-via-mcp

2 shared capabilities

MCP Server44

UI-TARS-desktop

The Open-Source Multimodal AI Agent Stack: Connecting Cutting-Edge AI Models and Agent Infra

gui-automation-via-screenshot-vlm-action-loop

1 shared capability

Best For

✓AI agent developers building autonomous task automation systems
✓Teams integrating Claude with legacy or proprietary desktop software
✓Researchers prototyping LLM-driven UI automation without building custom integrations
✓LLM agent developers building perception-action loops
✓Automation engineers debugging GUI interaction failures
✓Researchers studying LLM reasoning over visual UI state
✓Automation developers building click-based workflows on web and desktop UIs
✓Teams automating data entry or form submission across applications

Known Limitations

⚠No built-in OCR — relies on LLM's vision capabilities to interpret screen content, limiting accuracy on complex layouts
⚠Latency overhead from screenshot encoding/transmission per action cycle (typically 500ms-2s round-trip)
⚠No native support for multi-monitor setups or window-specific targeting — operates on full screen coordinates only
⚠Requires MCP client implementation; not directly usable as a standalone tool without wrapping in an agent framework
⚠No selective region capture — always captures full screen, increasing token usage for large displays
⚠Encoding overhead adds 100-300ms per screenshot depending on resolution and compression

Requirements

Node.js 16+ (MCP server runtime)MCP-compatible client (Claude API with MCP support, or custom agent framework)Display server access (X11/Wayland on Linux, native on macOS/Windows)Appropriate OS permissions for input simulation and screenshot captureNode.js 16+Display server with screenshot capability (X11, Wayland, native macOS/Windows)MCP client with image/base64 support in tool responsesOS input simulation permissions (may require elevated privileges on some systems)

Input / Output

Accepts: coordinate pairs (x, y for mouse), keyboard input strings, screenshot request signals, screenshot request (no parameters), x coordinate (integer, 0 to screen width), y coordinate (integer, 0 to screen height), click type (left, right, double), text string (arbitrary length), special key name (Enter, Tab, Escape, ArrowUp, ArrowDown, ArrowLeft, ArrowRight, etc.), modifier combinations (Ctrl+, Shift+, Alt+, Cmd+), MCP protocol messages (tool_call, resource_read, etc.), task description (natural language), initial screen state (screenshot)

Produces: PNG/JPEG screenshot data (base64 encoded), confirmation messages for input actions, structured metadata about screen state, base64-encoded PNG or JPEG image data, image metadata (dimensions, format), confirmation message (e.g., 'clicked at 512, 384'), error message if coordinates out of bounds, confirmation message (e.g., 'typed 'hello world''), error message if key not recognized, MCP tool definitions (schema, description, parameters), MCP tool responses (result, error), task completion status, sequence of actions taken, final screen state

UnfragileRank

Adoption15%(30% weight)

Quality14%(25% weight)

Ecosystem30%(25% weight)

Match Graph10%(15% weight)

Freshness75%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: MCP Server

6 capabilities

Visit @github/computer-use-mcp→

Package Details

npm

Registry

0.1.22

Version

Weekly Downloads

About

Computer Use MCP Server

Alternatives to @github/computer-use-mcp

IntelliCode50Extension

AI-assisted development

Compare →

GitHub Copilot Chat53Extension

AI chat features powered by Copilot

Compare →

GitHub Copilot52Extension

Your AI pair programmer

Compare →

Claude Code for VS Code52Extension

Claude Code for VS Code: Harness the power of Claude Code without leaving your IDE

Compare →

Are you the builder of @github/computer-use-mcp?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

mcp registry

Looking for something else?

Search →

Capabilities6 decomposed

gui automation via standardized mcp protocol

Medium confidence

Solves for

Best for

AI agent developers building autonomous task automation systems

Teams integrating Claude with legacy or proprietary desktop software

Researchers prototyping LLM-driven UI automation without building custom integrations

Requires

Node.js 16+ (MCP server runtime)

MCP-compatible client (Claude API with MCP support, or custom agent framework)

Display server access (X11/Wayland on Linux, native on macOS/Windows)

Limitations

No built-in OCR — relies on LLM's vision capabilities to interpret screen content, limiting accuracy on complex layouts

Latency overhead from screenshot encoding/transmission per action cycle (typically 500ms-2s round-trip)

No native support for multi-monitor setups or window-specific targeting — operates on full screen coordinates only

What makes it unique

vs alternatives

screenshot capture with llm-compatible encoding

Medium confidence

Solves for

Best for

LLM agent developers building perception-action loops

Automation engineers debugging GUI interaction failures

Researchers studying LLM reasoning over visual UI state

Requires

Node.js 16+

Display server with screenshot capability (X11, Wayland, native macOS/Windows)

MCP client with image/base64 support in tool responses

Limitations

No selective region capture — always captures full screen, increasing token usage for large displays

Encoding overhead adds 100-300ms per screenshot depending on resolution and compression

No built-in image optimization or downsampling — relies on client-side compression or LLM token budgeting

What makes it unique

vs alternatives

Simpler integration than Anthropic's computer-use API because it uses standard MCP tool responses; no special image handling protocol needed, just base64 encoding in tool output

mouse control with absolute positioning

Medium confidence

Solves for

Best for

Automation developers building click-based workflows on web and desktop UIs

Teams automating data entry or form submission across applications

Researchers testing LLM spatial reasoning on screen coordinates

Requires

Node.js 16+

OS input simulation permissions (may require elevated privileges on some systems)

MCP client capable of sending tool calls with numeric coordinate parameters

Limitations

Absolute positioning only — no relative motion, drag-and-drop, or gesture support (swipes, pinches)

No collision detection or validation — agent can click on invalid coordinates without feedback

Coordinate system is global screen space; no window-relative or element-relative targeting

What makes it unique

vs alternatives

More granular than high-level UI automation frameworks (Selenium, Playwright) because it operates at raw input level; more flexible for non-web UIs but requires agent to handle coordinate math

keyboard input with text and special key support

Medium confidence

Solves for

I want my agent to type text into a form field or search boxI need to send keyboard shortcuts (Ctrl+C, Cmd+V) to interact with applicationsI want to navigate UI menus using arrow keys and Enter

Best for

Automation engineers building text-input workflows (form filling, search, code entry)

Teams automating keyboard-driven applications (terminals, IDEs, legacy software)

Researchers studying LLM text generation for UI interaction

Requires

Node.js 16+

OS input simulation permissions

MCP client capable of sending tool calls with string and key parameters

Limitations

No keyboard state awareness — agent cannot detect if Caps Lock or Num Lock is active

No input validation or error recovery — typing into wrong field fails silently without feedback

Special key support depends on OS and keyboard layout; non-ASCII characters may not work reliably

What makes it unique

vs alternatives

More flexible than web automation tools (Selenium) for non-web applications; simpler than low-level keyboard event APIs because it abstracts key name resolution and modifier handling

mcp server lifecycle and tool registration

Medium confidence

Solves for

Best for

MCP client developers (Claude API, custom agent frameworks) integrating computer use

Teams building standardized agent platforms that support multiple tool providers

Researchers prototyping LLM agent architectures with pluggable tool systems

Requires

Node.js 16+

MCP client library or framework (e.g., @anthropic-sdk/sdk with MCP support)

Understanding of MCP protocol and tool schema format

Limitations

MCP protocol overhead adds latency to each tool invocation (typically 50-200ms per round-trip)

Tool schema must be statically defined at server startup — no dynamic tool registration based on runtime state

No built-in authentication or access control — assumes trusted client environment

What makes it unique

vs alternatives

agent-driven perception-action loop orchestration

Medium confidence

Solves for

Best for

Automation engineers building resilient, adaptive workflows for complex applications

Teams automating business processes that require visual feedback and error recovery

Researchers studying LLM reasoning and planning in interactive environments

Requires

Node.js 16+

MCP-compatible LLM client with vision capabilities (Claude 3.5+)

Sufficient LLM context window to maintain task state across multiple action cycles

Limitations

Latency compounds with loop iterations — each screenshot + action cycle adds 500ms-2s, making long workflows slow

No built-in state machine or workflow definition — agent must maintain task context and progress in its reasoning

No timeout or loop-break mechanisms — agent can get stuck in infinite loops if it misinterprets screen state

What makes it unique

vs alternatives

More flexible than RPA tools (UiPath, Blue Prism) because it uses LLM reasoning for adaptation; simpler than building custom agent frameworks because it leverages MCP's tool abstraction

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Alternatives to @github/computer-use-mcp

IntelliCode50Extension

AI-assisted development

Compare →

GitHub Copilot Chat53Extension

AI chat features powered by Copilot

Compare →

GitHub Copilot52Extension

Your AI pair programmer

Compare →

Claude Code for VS Code52Extension

Claude Code for VS Code: Harness the power of Claude Code without leaving your IDE

Compare →

@github/computer-use-mcp

Capabilities6 decomposed

gui automation via standardized mcp protocol

screenshot capture with llm-compatible encoding

mouse control with absolute positioning

keyboard input with text and special key support

mcp server lifecycle and tool registration

agent-driven perception-action loop orchestration

Related Artifactssharing capabilities

@atomicbotai/computer-use-mcp

Open Interpreter

just-every/mcp-screenshot-website-fast

gmod-mcp

@hisma/server-puppeteer

UI-TARS-desktop

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

Package Details

About

Categories

Alternatives to @github/computer-use-mcp

Are you the builder of @github/computer-use-mcp?

Get the weekly brief

Data Sources

@github/computer-use-mcp

Capabilities6 decomposed

gui automation via standardized mcp protocol

screenshot capture with llm-compatible encoding

mouse control with absolute positioning

keyboard input with text and special key support

mcp server lifecycle and tool registration

agent-driven perception-action loop orchestration

Related Artifactssharing capabilities

@atomicbotai/computer-use-mcp

Open Interpreter

just-every/mcp-screenshot-website-fast

gmod-mcp

@hisma/server-puppeteer

UI-TARS-desktop

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

Package Details

About

Categories

Alternatives to @github/computer-use-mcp

Are you the builder of @github/computer-use-mcp?

Get the weekly brief

Data Sources