OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview
FrameworkFreeScored 65.2% vs google's official 47.8%, and the existing top closed source model Junie CLI's 64.3%.Since there are a lot of reports of deliberate cheating on TerminalBench 2.0 lately (https://debugml.github.io/cheating-agents/), I would like to also clarify a few thing
Capabilities7 decomposed
terminal-command execution with llm reasoning
Medium confidenceExecutes shell commands in a sandboxed terminal environment while maintaining bidirectional context with an LLM agent. The agent receives command output, error streams, and exit codes in real-time, enabling it to reason about execution results and decide on next steps. Implements a command-response loop where the LLM can chain multiple commands based on previous outputs, with built-in handling for interactive prompts and long-running processes.
Implements a tight feedback loop between LLM reasoning and terminal execution with real-time output streaming, allowing agents to make decisions based on partial command results rather than waiting for full completion. Uses structured command schemas to constrain agent actions while preserving flexibility.
Outperforms alternatives on TerminalBench because it combines low-latency command execution with efficient context management, avoiding the overhead of cloud-based execution APIs while maintaining safety through schema-based action validation.
multi-step task decomposition and planning
Medium confidenceBreaks down complex terminal-based tasks into executable subtasks using chain-of-thought reasoning. The agent generates a plan, executes steps sequentially, and dynamically adjusts the plan based on intermediate results. Implements backtracking logic where failed steps trigger re-planning with updated context about what went wrong.
Uses dynamic re-planning triggered by execution failures rather than static pre-planning, allowing the agent to adapt strategies mid-execution. Maintains a reasoning trace that captures why plans changed, enabling better learning from failures.
More adaptive than fixed-pipeline agents because it re-evaluates the plan after each step, making it more resilient to unexpected command outputs or environmental changes.
structured action schema validation and execution
Medium confidenceEnforces a schema-based constraint system where the LLM can only execute actions (commands, API calls) that conform to predefined schemas. The framework validates action parameters before execution, preventing malformed or dangerous commands from reaching the terminal. Implements a registry pattern where actions are registered with type hints, constraints, and execution handlers.
Implements a two-stage validation pipeline: schema-level validation (parameter types, ranges) followed by semantic validation (path traversal checks, permission checks). Uses a registry pattern that allows runtime extension of available actions without modifying core agent logic.
Provides stronger safety guarantees than prompt-based instruction approaches because validation is enforced at the framework level, not dependent on LLM instruction-following.
context-aware command history and state tracking
Medium confidenceMaintains a structured history of all executed commands, their outputs, and side effects. The agent can query this history to understand what has already been done, avoiding redundant operations. Implements state snapshots at key points, allowing the agent to reason about system state changes and detect when commands had unexpected effects.
Implements differential state tracking where only changes between snapshots are stored, reducing memory overhead. Provides a queryable history interface that allows the agent to ask 'have I already installed package X?' rather than re-running discovery commands.
More efficient than naive history approaches because it uses differential snapshots and allows the agent to query history semantically rather than scanning raw logs.
error recovery and retry logic with exponential backoff
Medium confidenceAutomatically detects command failures (non-zero exit codes, timeout, resource exhaustion) and implements retry strategies with exponential backoff. Different error types trigger different recovery strategies: transient errors retry immediately, resource errors wait before retrying, and permanent errors trigger re-planning. Includes timeout handling for long-running commands with configurable thresholds.
Implements error classification at the framework level, mapping exit codes and error messages to retry strategies. Uses exponential backoff with jitter to prevent thundering herd problems in distributed scenarios.
More sophisticated than simple retry loops because it classifies errors and applies appropriate strategies, reducing wasted API calls and improving overall task success rates.
llm provider abstraction and multi-model support
Medium confidenceAbstracts the LLM backend behind a unified interface, allowing the agent to work with different providers (Gemini, OpenAI, Anthropic, local models) without code changes. Implements provider-specific adapters that handle differences in API formats, token counting, and function-calling schemas. Supports model switching at runtime based on task requirements or cost optimization.
Uses an adapter pattern where each provider has a concrete implementation handling API differences, token counting, and function-calling schema translation. Supports runtime model switching with automatic prompt/schema adaptation.
More flexible than provider-specific agents because it decouples agent logic from LLM implementation, enabling experimentation with different models without architectural changes.
benchmark-driven performance optimization
Medium confidenceImplements instrumentation and metrics collection throughout the agent execution pipeline to identify bottlenecks. Tracks latency per component (LLM inference, command execution, planning), token usage, and task success rates. Provides hooks for performance profiling and optimization, with built-in support for A/B testing different strategies.
Embeds performance instrumentation as a first-class concern in the agent architecture, not an afterthought. Provides structured metrics that enable direct comparison with other agents on standardized benchmarks like TerminalBench.
Enables data-driven optimization because metrics are collected systematically throughout execution, allowing precise identification of bottlenecks rather than guessing based on wall-clock time.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview, ranked by overlap. Discovered automatically through the match graph.
BabyCommandAGI
Test what happens when you combine CLI and LLM
Mini AGI
General-purpose agent based on GPT-3.5 / GPT-4
Docs
[Use cases](https://julius.ai/use_cases)
Voyager
LLM-powered lifelong learning agent in Minecraft
Cline
Autonomous AI coding assistant for VS Code — reads, edits, runs commands with human-in-the-loop approval.
Best For
- ✓developers building autonomous CLI automation agents
- ✓teams implementing DevOps task automation with LLM reasoning
- ✓researchers benchmarking agent performance on terminal-based tasks
- ✓autonomous DevOps agents handling multi-stage deployments
- ✓research projects evaluating agent planning capabilities
- ✓developers building task-oriented CLI assistants
- ✓production deployments requiring safety guardrails on agent actions
- ✓teams extending agents with custom domain-specific actions
Known Limitations
- ⚠Sandboxing scope depends on host OS permissions — cannot fully isolate destructive commands without containerization
- ⚠Real-time streaming of large command outputs may cause context window overflow in LLM
- ⚠Interactive terminal prompts (password inputs, confirmations) require pre-configured responses or timeout handling
- ⚠Plan quality degrades with task complexity — very deep dependency chains may exceed LLM reasoning capacity
- ⚠Backtracking can create loops if failure modes are not properly distinguished
- ⚠No built-in cost optimization — may generate redundant planning steps that increase API calls
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview
Categories
Alternatives to OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview
Search the Supabase docs for up-to-date guidance and troubleshoot errors quickly. Manage organizations, projects, databases, and Edge Functions, including migrations, SQL, logs, advisors, keys, and type generation, in one flow. Create and manage development branches to iterate safely, confirm costs
Compare →Are you the builder of OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →