Capability
Reinforce Leave One Out Rloo Policy Gradient Training
4 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “reinforce leave-one-out (rloo) for policy gradient optimization”
Reinforcement learning from human feedback — SFT, DPO, PPO trainers for LLM alignment.
Unique: Implements leave-one-out variance reduction with efficient batch computation, reducing gradient variance by 30-50% compared to standard REINFORCE while avoiding value function training overhead, enabling simpler RL training without critic networks
vs others: Simpler than PPO because it eliminates value function training and clipping logic, whereas PPO requires separate critic network and advantage estimation, making RLOO more suitable for simple reward functions