world-model-based reinforcement learning with latent imagination
DreamerV3 learns a compact world model that predicts future states in a learned latent space, then uses this model to plan and train policies through imagination without requiring environment interaction for every gradient step. The architecture uses a variational autoencoder (VAE) to compress observations into a latent representation, a recurrent state-space model to predict latent dynamics, and a decoder to reconstruct observations. Policy and value functions are trained on imagined trajectories generated by rolling out the world model, dramatically reducing sample complexity compared to model-free RL.
Unique: DreamerV3 uses a unified latent-space representation for both world modeling and policy learning, with a novel scaling approach (symlog) that handles rewards across 10+ orders of magnitude without task-specific normalization. Unlike prior world-model methods (PlaNet, Dreamer v1/v2), it achieves strong performance on both visual control and Atari without architectural changes, through improved training stability and a unified loss function that balances reconstruction, dynamics, and policy objectives.
vs alternatives: Outperforms model-free methods (PPO, SAC) on sample efficiency by 10-100x and matches or exceeds model-based alternatives (MBPO, SLAC) while requiring no task-specific reward normalization or domain adaptation, making it more practical for diverse visual domains.
multi-task visual policy learning with task-agnostic world models
DreamerV3 learns a single world model that captures visual dynamics common across multiple tasks, then trains separate task-specific policy heads that leverage the shared latent representation. The world model is trained on a mixture of trajectories from different tasks without explicit task conditioning, discovering task-invariant visual features (object motion, physics) that transfer across diverse objectives. Task-specific policies are trained through imagination using the shared world model, enabling rapid adaptation to new tasks with minimal additional data.
Unique: DreamerV3's task-agnostic world model learns shared visual representations without explicit task conditioning, relying on the policy learning objective to extract task-relevant information from the shared latent space. This contrasts with task-conditioned approaches (e.g., MTRL baselines) that explicitly encode task identity, making DreamerV3 more flexible for discovering emergent task structure.
vs alternatives: Achieves better sample efficiency and generalization than task-conditioned baselines by learning task-invariant visual dynamics, while avoiding the computational overhead of task-specific world models or explicit task embeddings.
grounding large language models in interactive environments with online rl (glam)
DreamerV3 is extended in the GLAM framework to ground large language models (LLMs) in interactive environments through online RL. The approach uses an LLM to generate high-level task descriptions or reward functions, which are then used to train RL agents in simulated or real environments. The agent learns a world model of the environment and uses it to optimize policies that maximize the LLM-specified rewards. This enables LLMs to interact with and learn from environments without explicit programming of reward functions or environment dynamics.
Unique: GLAM extends DreamerV3 to ground LLMs in interactive environments by using LLM-generated reward functions to train RL agents. The approach enables LLMs to specify complex objectives in natural language and learn from environment feedback through online RL.
vs alternatives: Enables more flexible and natural task specification compared to hand-crafted reward functions, while leveraging DreamerV3's sample efficiency to make LLM-guided RL practical despite the computational overhead of LLM inference.
continuous and discrete action space handling with unified latent planning
DreamerV3 handles both continuous (robotic control) and discrete (Atari games) action spaces through a unified policy parameterization in the learned latent space. The policy network outputs action distributions (Gaussian for continuous, categorical for discrete) that are sampled during imagination rollouts. The world model's dynamics function is action-agnostic, treating actions as inputs to the recurrent state predictor without architectural changes, enabling seamless switching between control modalities.
Unique: DreamerV3 uses a single latent-space policy architecture that parameterizes both continuous and discrete action distributions without task-specific modifications, treating action space type as a hyperparameter rather than an architectural choice. This contrasts with prior work that required separate policy heads or explicit action space handling.
vs alternatives: Enables unified training across Atari and continuous control benchmarks with identical code, whereas most RL frameworks require separate implementations or significant hyperparameter tuning per domain.
imagination-based policy optimization with latent rollouts
DreamerV3 trains policies by rolling out imagined trajectories in the learned latent space, computing policy gradients without environment interaction. The process involves: (1) sampling initial latent states from the world model's prior, (2) rolling out the policy in imagination for H steps, (3) computing returns using the value function, and (4) backpropagating policy gradients through the imagined trajectory. The world model is frozen during policy optimization, enabling efficient amortization of world model computation across multiple policy updates.
Unique: DreamerV3 uses a two-headed value function (critic and target) trained on imagined trajectories with symlog scaling, enabling stable policy optimization without explicit target networks or replay buffers. The imagination rollout is differentiable end-to-end, allowing gradients to flow through the world model during policy updates (though the world model is typically frozen).
vs alternatives: Achieves better sample efficiency than model-free RL (PPO, SAC) by training on imagined rollouts, while maintaining stability through careful value function design and avoiding the distribution shift issues that plague naive model-based approaches.
symlog reward scaling for multi-scale reward normalization
DreamerV3 introduces symlog (symmetric logarithm) scaling to handle rewards spanning 10+ orders of magnitude without task-specific normalization. The symlog function applies log scaling to large-magnitude rewards while preserving linear scaling for small rewards, enabling a single value function and reward prediction head to handle both sparse rewards (e.g., game scores of 0-1000) and dense rewards (e.g., continuous control with rewards in [-1, 1]). This is applied to both reward prediction in the world model and value function targets, eliminating the need for per-task reward normalization.
Unique: DreamerV3's symlog scaling is a learnable, differentiable transformation that handles both sparse and dense rewards without task-specific tuning, contrasted with prior approaches that required manual reward clipping, normalization, or separate value functions per task.
vs alternatives: Eliminates the need for per-task reward normalization (e.g., reward clipping, running mean/std) while maintaining stable value function learning, reducing engineering overhead compared to task-conditioned baselines.
joint world model and policy training with shared latent representation
DreamerV3 trains the world model and policy jointly using a unified loss function that combines reconstruction, dynamics, and policy objectives. The world model learns to compress observations into a latent space that is simultaneously useful for predicting future states and for learning control policies. The policy and value function are trained on imagined rollouts from the world model, creating a feedback loop where policy performance informs which latent features are most useful for control. This joint training is enabled by a shared encoder/decoder architecture and careful balancing of loss weights.
Unique: DreamerV3 uses a unified loss function that jointly optimizes reconstruction, dynamics, and policy objectives with learnable loss weights, enabling the policy to guide world model learning. This contrasts with prior approaches (PlaNet, Dreamer v1/v2) that trained world models and policies sequentially or with fixed loss weight ratios.
vs alternatives: Achieves better sample efficiency than sequential training by having the policy guide world model learning toward control-relevant features, while maintaining stability through careful loss balancing and shared representation learning.
visual observation encoding with vae-based latent compression
DreamerV3 uses a variational autoencoder (VAE) to compress high-dimensional visual observations (e.g., 64x64 RGB images) into a compact latent representation (typically 32-256 dimensions). The encoder network maps observations to a Gaussian distribution in latent space, while the decoder reconstructs observations from latent samples. The VAE is trained with a reconstruction loss (L2 or L1) and a KL divergence regularizer that encourages the latent distribution to match a standard normal prior. This compression enables efficient world model learning and policy optimization in the latent space.
Unique: DreamerV3's VAE encoder uses a fixed standard normal prior without learned variance, enabling stable training without posterior collapse. The decoder is trained jointly with the world model dynamics, allowing reconstruction quality to be optimized for dynamics prediction rather than pixel-perfect reconstruction.
vs alternatives: Achieves better sample efficiency than pixel-based RL by compressing observations into a latent space, while maintaining reconstruction quality through joint training with the world model. Simpler than disentanglement-focused VAE variants (β-VAE, Factor-VAE) while still learning useful visual representations.
+3 more capabilities