What can Large Language Models as Optimizers (OPRO) do?

llm-based gradient-free optimization via in-context learning, trajectory-conditioned solution generation with scoring feedback, prompt optimization via iterative refinement and scoring, hyperparameter optimization via llm-guided search, reward function discovery via code generation (eureka extension), multi-step reasoning trajectory generation for complex optimization

Large Language Models as Optimizers (OPRO)

Product

* ⏫ 10/2023: [Eureka: Human-Level Reward Design via Coding Large Language Models (Eureka)](https://arxiv.org/abs/2310.12931)

/ 100

6 capabilities

Capabilities6 decomposed

llm-based gradient-free optimization via in-context learning

Medium confidence

Uses large language models as black-box optimizers by prompting them with optimization trajectories (previous solutions and their scores) to generate improved candidate solutions iteratively. The LLM learns optimization patterns from in-context examples without explicit gradient computation, treating the optimization problem as a sequence prediction task where better solutions are generated by conditioning on historical performance data.

Solves for

Optimize hyperparameters, prompts, or configurations without access to gradientsFind better solutions to discrete or non-differentiable problems using only evaluation feedbackLeverage LLM reasoning to guide search through high-dimensional solution spacesReduce optimization iterations by using LLM's learned priors about what makes good solutions

Best for

Researchers optimizing prompt templates or hyperparameters for LLM tasks

Teams solving discrete optimization problems where gradient-based methods are infeasible

Practitioners needing few-shot optimization without training custom models

Requires

Access to a capable LLM (GPT-3.5+ or equivalent) via API or local deployment

A differentiable or evaluable objective function that can score candidate solutions

Ability to serialize solutions as text for LLM input

Limitations

Optimization quality depends heavily on LLM's ability to recognize patterns in the trajectory history — may plateau on complex multimodal landscapes

Each optimization step requires a full LLM forward pass, making it computationally expensive compared to gradient-based methods for large-scale problems

No theoretical convergence guarantees; performance is empirical and problem-dependent

What makes it unique

Treats optimization as an in-context learning problem where the LLM infers optimization dynamics from trajectory history rather than using explicit gradient signals or learned surrogate models. The key architectural insight is that LLMs can act as meta-optimizers by recognizing patterns in (solution, score) pairs and generating better candidates without domain-specific training.

vs alternatives

Outperforms traditional Bayesian optimization and evolutionary algorithms on discrete/non-differentiable problems by leveraging LLM's semantic understanding of solution space structure, while requiring no gradient computation or surrogate model training.

trajectory-conditioned solution generation with scoring feedback

Medium confidence

Implements an iterative loop where the LLM receives a formatted history of (solution, evaluation_score) pairs and generates a new candidate solution. The prompt structure encodes the optimization trajectory as in-context examples, allowing the LLM to learn implicit patterns about which solution characteristics correlate with higher scores. After evaluation, the new solution and its score are appended to the trajectory for the next iteration.

Solves for

Iteratively refine solutions by showing the LLM what worked and what didn'tBuild optimization trajectories that demonstrate solution quality trendsEnable the LLM to discover domain-specific heuristics from evaluation feedbackImplement few-shot meta-learning for optimization without retraining

Best for

Prompt engineers optimizing instruction templates for downstream tasks

Hyperparameter tuning for machine learning models

Discrete optimization problems (e.g., combinatorial search, code generation)

Requires

LLM with sufficient context window (4K+ tokens recommended for meaningful trajectory history)

Evaluation function that returns scalar scores (or easily interpretable metrics)

Ability to format solutions and scores as natural language or structured text

Limitations

Trajectory length is bounded by LLM context window; long optimization histories may be truncated or summarized, losing fine-grained signal

LLM may overfit to spurious correlations in short trajectories, generating solutions that exploit evaluation noise rather than improving fundamentally

No mechanism to enforce diversity in generated solutions; may converge to local optima or repetitive candidates

What makes it unique

Encodes the full optimization history as in-context examples rather than using a learned surrogate model or explicit reward function. The LLM implicitly learns to recognize patterns in the trajectory (e.g., 'solutions with property X scored higher') and applies those patterns to generate the next candidate, enabling adaptation without explicit model updates.

vs alternatives

Simpler and faster to implement than Bayesian optimization or neural surrogate models, while capturing richer semantic patterns than random search or grid search by leveraging the LLM's pre-trained understanding of solution quality.

prompt optimization via iterative refinement and scoring

Medium confidence

Applies the OPRO framework specifically to optimize natural language prompts by treating prompt text as the solution space and downstream task performance (e.g., accuracy on a benchmark) as the evaluation metric. The LLM generates improved prompt variations by analyzing which previous prompts achieved higher scores, learning to modify instruction phrasing, examples, and constraints to maximize task performance. This enables automated prompt engineering without manual trial-and-error.

Solves for

Automatically improve prompt templates for classification, summarization, or reasoning tasksDiscover effective instruction phrasings that outperform hand-crafted promptsAdapt prompts to new domains or tasks by learning from evaluation feedbackScale prompt optimization across multiple tasks without manual intervention

Best for

ML teams optimizing prompts for production LLM applications

Researchers studying prompt design and instruction engineering

Practitioners building few-shot learning systems with limited labeled data

Requires

Access to an LLM (GPT-3.5+ or equivalent) for prompt generation

A downstream task with a quantifiable evaluation metric (accuracy, F1, BLEU, etc.)

Evaluation dataset or benchmark to score prompt candidates

Limitations

Optimization is task-specific; prompts optimized for one task may not transfer to different domains or LLM models

Evaluation requires running the downstream task multiple times, incurring significant computational and API costs

LLM-generated prompts may be verbose, redundant, or contain unnecessary complexity compared to human-written prompts

What makes it unique

Treats prompts as first-class optimization variables, using the LLM itself to generate improved prompts by analyzing which previous prompts achieved higher downstream task performance. This creates a self-improving loop where the LLM learns to write better instructions for itself or other models, without requiring gradient computation or labeled training data.

vs alternatives

Faster and cheaper than manual prompt engineering or grid search, while more interpretable and controllable than black-box hyperparameter optimization, because the LLM generates human-readable prompts that practitioners can understand and further refine.

hyperparameter optimization via llm-guided search

Medium confidence

Applies OPRO to optimize hyperparameters (learning rates, batch sizes, regularization coefficients, etc.) by representing hyperparameter configurations as text and iteratively generating improved configurations based on their validation performance. The LLM learns implicit relationships between hyperparameter values and model performance from the trajectory history, generating candidates that balance exploration (trying new values) and exploitation (refining promising regions).

Solves for

Automatically tune hyperparameters for machine learning models without manual grid/random searchDiscover hyperparameter configurations that outperform defaults or hand-tuned valuesAdapt hyperparameters to new datasets or model architectures by learning from evaluation feedbackReduce hyperparameter tuning time and computational cost compared to exhaustive search

Best for

ML engineers tuning models for production deployment

Researchers exploring hyperparameter sensitivity across datasets

Teams with limited compute budgets seeking efficient tuning

Requires

Access to an LLM for configuration generation

A trainable model with a validation metric (accuracy, loss, F1, etc.)

Computational resources to train multiple model instances

Limitations

Optimization quality depends on LLM's ability to infer hyperparameter-performance relationships from limited trajectory data; may miss non-obvious interactions

Each iteration requires training a full model and evaluating on validation data, making this computationally expensive for large models or datasets

LLM may generate out-of-range or invalid hyperparameter values (e.g., negative learning rates) requiring post-hoc filtering or constraint enforcement

What makes it unique

Uses the LLM's semantic understanding of numerical relationships to generate hyperparameter configurations that are more likely to improve performance, rather than random sampling or grid search. The LLM learns implicit patterns like 'smaller learning rates help with larger models' or 'higher dropout rates reduce overfitting' from the trajectory, enabling more intelligent exploration.

vs alternatives

More interpretable than Bayesian optimization (generates human-readable configurations) and faster than random/grid search, while requiring no surrogate model training or gradient computation. However, slower than specialized AutoML tools like Optuna or Hyperband that use learned surrogates.

reward function discovery via code generation (eureka extension)

Medium confidence

Extends OPRO to automatically design reward functions for reinforcement learning by prompting an LLM to generate Python code that computes rewards based on environment observations. The LLM iteratively refines reward functions by analyzing which previous reward functions led to better task performance (e.g., higher episode returns), learning to write code that captures task-relevant objectives without manual reward engineering. This enables automated reward design for complex control tasks.

Solves for

Automatically design reward functions for RL agents without manual engineeringDiscover reward functions that lead to better task performance than hand-crafted rewardsAdapt reward functions to new tasks or environments by learning from RL training resultsEnable non-experts to train RL agents by automating the reward design bottleneck

Best for

Robotics researchers training agents for manipulation or locomotion tasks

RL practitioners seeking to avoid manual reward engineering

Teams building general-purpose RL systems that adapt to new tasks

Requires

Access to an LLM capable of generating syntactically correct Python code (GPT-3.5+ or equivalent)

A differentiable RL environment with observable state and action spaces

RL training infrastructure (e.g., PyTorch, TensorFlow, JAX) and computational resources (GPUs/TPUs)

Limitations

Reward functions generated by the LLM may be brittle, exploiting unintended environment dynamics (reward hacking) rather than learning robust behaviors

Each iteration requires training a full RL agent to convergence, incurring massive computational cost (hours to days per iteration)

LLM-generated code may contain bugs, inefficiencies, or numerical instabilities that degrade RL training

What makes it unique

Generates reward functions as executable Python code rather than treating them as hyperparameters or learned models. The LLM learns to write code that captures task-relevant objectives by analyzing which reward functions led to better RL agent performance, enabling discovery of novel reward structures that humans might not manually design.

vs alternatives

Eliminates manual reward engineering bottleneck in RL, enabling faster iteration and discovery of non-obvious reward structures. More flexible than inverse RL (which requires demonstrations) and more interpretable than learned reward models, though computationally expensive due to RL training cost per iteration.

multi-step reasoning trajectory generation for complex optimization

Medium confidence

Extends OPRO to handle complex optimization problems by prompting the LLM to generate multi-step reasoning or decomposed solutions rather than single-shot candidates. The LLM learns to break down optimization problems into subproblems, generate intermediate solutions, and compose them into final candidates. This enables optimization of problems with hierarchical or compositional structure, where the LLM's reasoning process itself becomes part of the optimization trajectory.

Solves for

Optimize complex problems with hierarchical or compositional structureLeverage LLM reasoning to decompose problems into more tractable subproblemsGenerate solutions that require multi-step planning or constraint satisfactionImprove optimization quality by incorporating LLM's reasoning transparency

Best for

Researchers optimizing complex algorithms or system designs

Teams solving constraint satisfaction or combinatorial optimization problems

Practitioners building planning systems that require interpretable reasoning

Requires

Access to an LLM with strong reasoning capabilities (GPT-4 or equivalent)

Sufficient context window to accommodate multi-step reasoning (8K+ tokens recommended)

Ability to parse and validate multi-step solutions

Limitations

Multi-step reasoning increases prompt length and LLM latency, making optimization slower and more expensive

Reasoning quality is difficult to evaluate; LLM may generate plausible-sounding but incorrect reasoning

Decomposition strategy is problem-specific; no general method to automatically determine optimal decomposition

What makes it unique

Treats the LLM's reasoning process as part of the optimization trajectory, allowing the optimizer to learn not just what solutions are good, but how to reason about generating good solutions. This enables optimization of problems where the reasoning path is as important as the final answer.

vs alternatives

More interpretable and flexible than black-box optimization for complex problems, while leveraging LLM's reasoning capabilities to handle problems that require planning or constraint satisfaction. Slower than single-shot generation but enables optimization of problems that single-shot approaches cannot solve.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with Large Language Models as Optimizers (OPRO), ranked by overlap. Discovered automatically through the match graph.

Repository23

Agents

Library/framework for building language agents

symbolic-learning-based agent optimizationlanguage-based loss evaluation and gradient generationprompt-and-tool-parameter optimization

3 shared capabilities

Product17

Mathematical discoveries from program search with large language models (FunSearch)

### Audio Processing <a name="2023ap"></a>

iterative program refinement with failure-driven learningconstraint-aware program generation with multi-objective evaluation

2 shared capabilities

Product17

Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization (Retroformer)

### Other Papers <a name="2023op"></a>

retrospective trajectory optimization via policy gradient learningtrajectory filtering and quality-based curriculum learning

2 shared capabilities

Product19

Build a Large Language Model (From Scratch)

A guide to building your own working LLM, by Sebastian Raschka.

optimization-algorithm-implementation

1 shared capability

Agent54

hello-agents

📚 《从零开始构建智能体》——从零开始的智能体原理与实践教程

agentic reinforcement learning training pipeline for agent optimization

1 shared capability

Product26

Tutory

AI-driven tutor and teaching assistant for personalized...

adaptive-learning-path-generation

1 shared capability

Best For

✓Researchers optimizing prompt templates or hyperparameters for LLM tasks
✓Teams solving discrete optimization problems where gradient-based methods are infeasible
✓Practitioners needing few-shot optimization without training custom models
✓AutoML and neural architecture search applications
✓Prompt engineers optimizing instruction templates for downstream tasks
✓Hyperparameter tuning for machine learning models
✓Discrete optimization problems (e.g., combinatorial search, code generation)
✓Few-shot learning scenarios with limited evaluation budget

Known Limitations

⚠Optimization quality depends heavily on LLM's ability to recognize patterns in the trajectory history — may plateau on complex multimodal landscapes
⚠Each optimization step requires a full LLM forward pass, making it computationally expensive compared to gradient-based methods for large-scale problems
⚠No theoretical convergence guarantees; performance is empirical and problem-dependent
⚠Requires sufficient evaluation budget to build meaningful in-context examples; performs poorly with <5-10 prior solutions
⚠LLM may generate solutions that are syntactically valid but semantically nonsensical for the target domain
⚠Trajectory length is bounded by LLM context window; long optimization histories may be truncated or summarized, losing fine-grained signal

Requirements

Access to a capable LLM (GPT-3.5+ or equivalent) via API or local deploymentA differentiable or evaluable objective function that can score candidate solutionsAbility to serialize solutions as text for LLM inputPython 3.7+ for typical implementationsLLM with sufficient context window (4K+ tokens recommended for meaningful trajectory history)Evaluation function that returns scalar scores (or easily interpretable metrics)Ability to format solutions and scores as natural language or structured textDeterministic or low-variance evaluation to avoid noisy feedback

Input / Output

Accepts: text (problem description, constraints), structured data (optimization trajectory: previous solutions + their scores), code (for hyperparameter or prompt optimization tasks), text (problem statement, constraints, evaluation criteria), structured data (trajectory: list of [solution, score] pairs), code (for code-based optimization tasks), text (initial prompt template, task description, evaluation criteria), structured data (trajectory of previous prompts and their scores), code (evaluation function or benchmark), text (hyperparameter space definition, constraints, model description), structured data (trajectory of previous configurations and their validation scores), code (model training script, evaluation function), text (task description, environment specification, constraints on reward function), structured data (trajectory of previous reward functions and their RL training results), code (environment simulator, RL training script), text (problem description, decomposition strategy, reasoning constraints), structured data (trajectory of previous reasoning traces and their scores)

Produces: text (optimized solution candidate), structured data (optimization trajectory with new solution and score), code (optimized hyperparameters, prompts, or configurations), text (next candidate solution), structured data (updated trajectory with new solution and score), metrics (optimization progress, convergence diagnostics), text (optimized prompt template), structured data (optimization trajectory, performance metrics), metrics (task performance improvement, convergence analysis), text (optimized hyperparameter configuration), structured data (optimization trajectory, performance curves), metrics (best validation score, convergence rate, hyperparameter importance), code (Python reward function), structured data (optimization trajectory, RL training curves), metrics (best episode return, reward function complexity, convergence diagnostics), text (multi-step reasoning trace, final solution), structured data (decomposed subproblems, intermediate solutions, optimization trajectory)

UnfragileRank

Adoption15%(30% weight)

Quality22%(25% weight)

Ecosystem15%(15% weight)

Match Graph10%(25% weight)

Freshness75%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Product

6 capabilities

Visit Large Language Models as Optimizers (OPRO)→

About

* ⏫ 10/2023: [Eureka: Human-Level Reward Design via Coding Large Language Models (Eureka)](https://arxiv.org/abs/2310.12931)

Alternatives to Large Language Models as Optimizers (OPRO)

IntelliCode50Extension

AI-assisted development

Compare →

GitHub Copilot Chat53Extension

AI chat features powered by Copilot

Compare →

GitHub Copilot52Extension

Your AI pair programmer

Compare →

Claude Code for VS Code52Extension

Claude Code for VS Code: Harness the power of Claude Code without leaving your IDE

Compare →

Are you the builder of Large Language Models as Optimizers (OPRO)?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

github awesome

Looking for something else?

Search →

Capabilities6 decomposed

llm-based gradient-free optimization via in-context learning

Medium confidence

Solves for

Best for

Researchers optimizing prompt templates or hyperparameters for LLM tasks

Teams solving discrete optimization problems where gradient-based methods are infeasible

Practitioners needing few-shot optimization without training custom models

Requires

Access to a capable LLM (GPT-3.5+ or equivalent) via API or local deployment

A differentiable or evaluable objective function that can score candidate solutions

Ability to serialize solutions as text for LLM input

Limitations

Optimization quality depends heavily on LLM's ability to recognize patterns in the trajectory history — may plateau on complex multimodal landscapes

Each optimization step requires a full LLM forward pass, making it computationally expensive compared to gradient-based methods for large-scale problems

No theoretical convergence guarantees; performance is empirical and problem-dependent

What makes it unique

vs alternatives

trajectory-conditioned solution generation with scoring feedback

Medium confidence

Solves for

Best for

Prompt engineers optimizing instruction templates for downstream tasks

Hyperparameter tuning for machine learning models

Discrete optimization problems (e.g., combinatorial search, code generation)

Requires

LLM with sufficient context window (4K+ tokens recommended for meaningful trajectory history)

Evaluation function that returns scalar scores (or easily interpretable metrics)

Ability to format solutions and scores as natural language or structured text

Limitations

Trajectory length is bounded by LLM context window; long optimization histories may be truncated or summarized, losing fine-grained signal

LLM may overfit to spurious correlations in short trajectories, generating solutions that exploit evaluation noise rather than improving fundamentally

No mechanism to enforce diversity in generated solutions; may converge to local optima or repetitive candidates

What makes it unique

vs alternatives

prompt optimization via iterative refinement and scoring

Medium confidence

Solves for

Best for

ML teams optimizing prompts for production LLM applications

Researchers studying prompt design and instruction engineering

Practitioners building few-shot learning systems with limited labeled data

Requires

Access to an LLM (GPT-3.5+ or equivalent) for prompt generation

A downstream task with a quantifiable evaluation metric (accuracy, F1, BLEU, etc.)

Evaluation dataset or benchmark to score prompt candidates

Limitations

Optimization is task-specific; prompts optimized for one task may not transfer to different domains or LLM models

Evaluation requires running the downstream task multiple times, incurring significant computational and API costs

LLM-generated prompts may be verbose, redundant, or contain unnecessary complexity compared to human-written prompts

What makes it unique

vs alternatives

hyperparameter optimization via llm-guided search

Medium confidence

Solves for

Best for

ML engineers tuning models for production deployment

Researchers exploring hyperparameter sensitivity across datasets

Teams with limited compute budgets seeking efficient tuning

Requires

Access to an LLM for configuration generation

A trainable model with a validation metric (accuracy, loss, F1, etc.)

Computational resources to train multiple model instances

Limitations

Optimization quality depends on LLM's ability to infer hyperparameter-performance relationships from limited trajectory data; may miss non-obvious interactions

Each iteration requires training a full model and evaluating on validation data, making this computationally expensive for large models or datasets

LLM may generate out-of-range or invalid hyperparameter values (e.g., negative learning rates) requiring post-hoc filtering or constraint enforcement

What makes it unique

vs alternatives

reward function discovery via code generation (eureka extension)

Medium confidence

Solves for

Best for

Robotics researchers training agents for manipulation or locomotion tasks

RL practitioners seeking to avoid manual reward engineering

Teams building general-purpose RL systems that adapt to new tasks

Requires

Access to an LLM capable of generating syntactically correct Python code (GPT-3.5+ or equivalent)

A differentiable RL environment with observable state and action spaces

RL training infrastructure (e.g., PyTorch, TensorFlow, JAX) and computational resources (GPUs/TPUs)

Limitations

Reward functions generated by the LLM may be brittle, exploiting unintended environment dynamics (reward hacking) rather than learning robust behaviors

Each iteration requires training a full RL agent to convergence, incurring massive computational cost (hours to days per iteration)

LLM-generated code may contain bugs, inefficiencies, or numerical instabilities that degrade RL training

What makes it unique

vs alternatives

multi-step reasoning trajectory generation for complex optimization

Medium confidence

Solves for

Best for

Researchers optimizing complex algorithms or system designs

Teams solving constraint satisfaction or combinatorial optimization problems

Practitioners building planning systems that require interpretable reasoning

Requires

Access to an LLM with strong reasoning capabilities (GPT-4 or equivalent)

Sufficient context window to accommodate multi-step reasoning (8K+ tokens recommended)

Ability to parse and validate multi-step solutions

Limitations

Multi-step reasoning increases prompt length and LLM latency, making optimization slower and more expensive

Reasoning quality is difficult to evaluate; LLM may generate plausible-sounding but incorrect reasoning

Decomposition strategy is problem-specific; no general method to automatically determine optimal decomposition

What makes it unique

vs alternatives

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Alternatives to Large Language Models as Optimizers (OPRO)

IntelliCode50Extension

AI-assisted development

Compare →

GitHub Copilot Chat53Extension

AI chat features powered by Copilot

Compare →

GitHub Copilot52Extension

Your AI pair programmer

Compare →

Claude Code for VS Code52Extension

Claude Code for VS Code: Harness the power of Claude Code without leaving your IDE

Compare →

Large Language Models as Optimizers (OPRO)

Capabilities6 decomposed

llm-based gradient-free optimization via in-context learning

trajectory-conditioned solution generation with scoring feedback

prompt optimization via iterative refinement and scoring

hyperparameter optimization via llm-guided search

reward function discovery via code generation (eureka extension)

multi-step reasoning trajectory generation for complex optimization

Related Artifactssharing capabilities

Agents

Mathematical discoveries from program search with large language models (FunSearch)

Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization (Retroformer)

Build a Large Language Model (From Scratch)

hello-agents

Tutory

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to Large Language Models as Optimizers (OPRO)

Are you the builder of Large Language Models as Optimizers (OPRO)?

Get the weekly brief

Data Sources

Large Language Models as Optimizers (OPRO)

Capabilities6 decomposed

llm-based gradient-free optimization via in-context learning

trajectory-conditioned solution generation with scoring feedback

prompt optimization via iterative refinement and scoring

hyperparameter optimization via llm-guided search

reward function discovery via code generation (eureka extension)

multi-step reasoning trajectory generation for complex optimization

Related Artifactssharing capabilities

Agents

Mathematical discoveries from program search with large language models (FunSearch)

Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization (Retroformer)

Build a Large Language Model (From Scratch)

hello-agents

Tutory

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to Large Language Models as Optimizers (OPRO)

Are you the builder of Large Language Models as Optimizers (OPRO)?

Get the weekly brief

Data Sources