ARC-AGI
BenchmarkFreeAbstract reasoning benchmark with $1M prize for AGI.
Capabilities12 decomposed
interactive-visual-puzzle-task-generation
Medium confidenceGenerates and renders abstract visual puzzle tasks as interactive game environments where agents must explore state spaces, plan actions, and achieve goals through a Percept → Plan → Action cycle. Tasks are presented in configurable rendering modes (terminal text-based or programmatic API access) and support memory persistence across action sequences, enabling agents to learn patterns from minimal examples.
Implements tasks as interactive game environments with agent-based exploration rather than static puzzle-solving; agents must discover patterns through action-observation cycles with memory and goal acquisition, mirroring human learning efficiency on novel tasks. Rendering modes support both human-interpretable terminal output (+2K FPS without rendering) and programmatic API access for scalable evaluation.
Differs from static benchmark suites (MMLU, ARC-Easy) by requiring agents to actively explore and plan within unfamiliar environments, measuring learning efficiency and abstract reasoning rather than knowledge retrieval or pattern matching on familiar domains.
local-python-sdk-task-execution
Medium confidenceProvides a Python SDK (arc-agi package) for local execution of benchmark tasks with configurable rendering modes and performance optimization. The SDK exposes a GameAction class for discrete action specification, an Arcade environment factory for task instantiation, and a scorecard evaluation system. Execution runs entirely client-side without mandatory cloud dependencies, achieving 2000+ FPS when rendering is disabled.
Implements dual-mode execution: high-performance local evaluation (2K+ FPS) without rendering for batch evaluation, and optional terminal rendering for human inspection. Avoids cloud dependency and API rate limits by running tasks entirely client-side, enabling tight integration with custom training loops and offline evaluation.
Faster than cloud-only benchmarks (e.g., OpenAI Evals) by eliminating network round-trips; more flexible than static test suites by supporting programmatic task instantiation and custom action spaces through the GameAction abstraction.
environment-step-based-interaction-loop
Medium confidenceImplements the core agent-environment interaction loop through env.step(action), which executes an action, updates task state, and returns observations. The step function encapsulates the Percept → Plan → Action cycle, enabling agents to iteratively explore tasks and learn patterns. Step returns observation, done flag, and implicit feedback enabling agents to assess action effectiveness.
Implements the core Percept → Plan → Action cycle through a step function that encapsulates state updates and observation generation. Implicit feedback enables agents to assess action effectiveness without explicit reward signals.
More flexible than explicit-reward benchmarks by enabling agents to infer success from observations; more realistic than single-step reasoning by supporting iterative exploration and learning.
open-source-benchmark-ecosystem
Medium confidenceProvides open-source access to benchmark tasks, evaluation infrastructure, and reference implementations, enabling community-driven research and algorithm development. The benchmark is published on GitHub with MIT license (implied by open-source claim), supporting reproducibility, contribution, and derivative work. Foundation explicitly emphasizes 'open-source ecosystem' and rewards open-source contributions through ARC Prize 2026.
Provides fully open-source benchmark with explicit community-driven research model and financial incentives (ARC Prize 2026) for open-source contributions. Foundation emphasizes ecosystem development and rewards novel algorithmic progress through prize pool.
More transparent than proprietary benchmarks by open-sourcing all code and tasks; more incentivized than academic benchmarks by offering prize money for contributions and progress.
rest-api-based-remote-task-access
Medium confidenceExposes benchmark tasks and evaluation through a REST API (documented at https://docs.arcprize.org) with API key authentication, enabling remote task access without local installation. The API abstracts task execution and scoring, allowing integration into web-based systems, cloud pipelines, and multi-language environments. Authentication uses API keys (with anonymous access available but limited).
Decouples task execution from local environment by exposing a REST API layer, enabling language-agnostic access and cloud-native integration. Supports both authenticated (API key) and anonymous access modes, with performance optimization through optional local caching or remote execution.
More flexible than SDK-only benchmarks by supporting remote access and multi-language clients; more standardized than custom evaluation scripts by providing a centralized API endpoint with consistent versioning and authentication.
abstract-pattern-recognition-evaluation
Medium confidenceMeasures an AI system's ability to recognize and generalize abstract patterns from minimal examples (1-5 training demonstrations) without domain-specific knowledge or pre-training on similar tasks. Evaluation is based on whether agents can infer transformation rules, spatial relationships, and logical operations from limited visual evidence and apply them to novel test cases. This capability directly measures fluid intelligence and learning efficiency rather than memorized knowledge.
Explicitly designed to measure learning efficiency and abstract reasoning on novel tasks, resisting scaling-only solutions. Foundation claims 'scaling alone will not reach AGI' and positions ARC-AGI as identifying capability gaps that require new algorithmic ideas, not just parameter scaling.
Differs from knowledge benchmarks (MMLU, TriviaQA) by requiring genuine learning and generalization rather than retrieval; differs from domain-specific reasoning benchmarks (math, code) by using abstract visual puzzles without domain conventions or pre-training advantages.
agent-memory-and-goal-acquisition
Medium confidenceSupports agent memory persistence and goal acquisition across action sequences, enabling agents to maintain state, learn from observations, and dynamically discover task objectives. The Percept → Plan → Action cycle allows agents to accumulate knowledge across multiple steps, with memory mechanisms enabling pattern recognition and strategy refinement. Goals are not explicitly provided; agents must infer them from task structure and feedback.
Implements implicit goal acquisition where agents must discover task objectives through exploration and observation rather than explicit specification. Memory mechanisms enable agents to accumulate knowledge across action sequences, supporting iterative refinement and pattern learning.
More challenging than explicit-goal benchmarks (e.g., Atari) by requiring agents to infer objectives; more realistic than single-step reasoning tasks by supporting multi-step planning and memory-based learning.
configurable-rendering-and-visualization
Medium confidenceProvides dual rendering modes for task visualization: terminal-based text rendering for human inspection and programmatic access (no rendering) for high-performance evaluation. Terminal mode enables visual debugging and human understanding of task state, while the no-render mode optimizes for throughput (2000+ FPS) by eliminating rendering overhead. Rendering mode is configurable per task instantiation.
Implements dual-mode rendering with explicit performance optimization: terminal mode for interpretability and programmatic mode for throughput (2K+ FPS). Rendering is configurable at instantiation, enabling developers to balance debugging capability and evaluation speed.
More flexible than single-mode benchmarks by supporting both human inspection and high-performance evaluation; faster than graphical rendering systems by offering text-based and no-render alternatives.
scorecard-based-evaluation-aggregation
Medium confidenceAggregates task performance into a structured scorecard that summarizes agent evaluation results across the benchmark. The scorecard is generated via arc.get_scorecard() and provides aggregated metrics, though the exact structure and metrics are not formally documented. Scorecard enables comparison across agents and tracking of performance progress.
Provides a standardized scorecard abstraction for aggregating task performance, enabling consistent comparison across agents and competition submissions. Scorecard generation is decoupled from task execution, allowing post-hoc analysis and custom metric computation.
More standardized than custom evaluation scripts by providing a centralized scorecard API; more flexible than fixed-metric benchmarks by supporting custom analysis of underlying task results.
arc-prize-2026-competition-integration
Medium confidenceIntegrates with the ARC Prize 2026 competition infrastructure, enabling researchers to submit solutions, receive evaluation on held-out test sets, and compete for $2M in prizes. Competition is hosted on Kaggle and provides standardized submission mechanisms, leaderboard tracking, and prize distribution. The foundation rewards open-source contributions and novel algorithmic progress.
Integrates benchmark with active competition infrastructure ($2M prize pool, Kaggle hosting) and explicitly rewards open-source contributions, creating financial incentives for novel algorithmic progress. Provides access to held-out test sets for official evaluation beyond public benchmark.
More incentivized than academic benchmarks by offering prize money; more transparent than proprietary competitions by emphasizing open-source contributions and community-driven research.
task-id-based-environment-instantiation
Medium confidenceEnables task instantiation by task ID (e.g., 'ls20', 'ft09') through the Arcade.make() factory method, abstracting task loading and initialization. Task IDs map to specific puzzle instances in the benchmark, allowing reproducible task selection and batch evaluation. The factory pattern supports configurable rendering modes and other task parameters.
Implements task instantiation via factory pattern with task ID abstraction, enabling reproducible task selection and batch evaluation without exposing task loading details. Task IDs provide stable references across benchmark versions.
More reproducible than random task selection by enabling explicit task ID specification; more flexible than fixed task lists by supporting dynamic task loading via factory method.
gameaction-discrete-action-space
Medium confidenceDefines a discrete action space through the GameAction enum, enabling agents to interact with tasks through a fixed set of predefined actions. Actions are specified as enum values (e.g., GameAction.ACTION1) and passed to env.step(), abstracting the underlying action semantics. The action space is task-agnostic, supporting a consistent interface across all benchmark tasks.
Abstracts task interaction through a discrete GameAction enum, providing a consistent interface across all benchmark tasks. Action semantics are abstracted, enabling agents to learn action effects through observation rather than explicit specification.
More standardized than task-specific action interfaces by providing a unified enum; more flexible than fixed action sets by supporting task-agnostic action selection.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with ARC-AGI, ranked by overlap. Discovered automatically through the match graph.
TaskWeaver
Microsoft's code-first agent for data analytics.
TaskWeaver
The first "code-first" agent framework for seamlessly planning and executing data analytics tasks.
Open Interpreter
OpenAI's Code Interpreter in your terminal, running locally.
Cline (Claude Dev)
Autonomous AI coding agent with file and terminal control.
ChatGPT
ChatGPT by OpenAI is a large language model that interacts in a conversational way.
QuantHUB
Elevate data skills with AI-driven, tailored learning...
Best For
- ✓AI researchers measuring general reasoning capabilities
- ✓Teams developing reasoning-focused LLM agents
- ✓Benchmark participants competing in the ARC Prize 2026
- ✓Researchers with local compute resources and reproducibility requirements
- ✓Teams building custom agents requiring tight integration with evaluation loop
- ✓Developers optimizing for evaluation throughput in iterative development
- ✓Agents using iterative planning or reinforcement learning
- ✓Teams developing multi-step reasoning approaches
Known Limitations
- ⚠Visual-only format excludes language-based reasoning; no text input/output in puzzle solving
- ⚠Task specifics (grid dimensions, color palettes, transformation rules) not fully documented in public materials
- ⚠No dynamic task generation or rotation mentioned; contamination risk high in active competition environment
- ⚠Evaluation protocol and statistical rigor not formally specified; no confidence intervals or significance testing documented
- ⚠Single-agent focus; does not measure multi-agent coordination or collaborative reasoning
- ⚠Local execution requires sufficient disk space and memory for all task environments
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Abstraction and Reasoning Corpus benchmark designed to measure general intelligence in AI systems through novel visual puzzles requiring abstract pattern recognition, with a $1M prize for solutions matching human performance.
Categories
Alternatives to ARC-AGI
Are you the builder of ARC-AGI?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →