Reinforce Leave One Out Rloo For Policy Gradient Optimization

1

TRLRepository55/100

via “reinforce leave-one-out (rloo) for policy gradient optimization”

Reinforcement learning from human feedback — SFT, DPO, PPO trainers for LLM alignment.

Unique: Implements leave-one-out variance reduction with efficient batch computation, reducing gradient variance by 30-50% compared to standard REINFORCE while avoiding value function training overhead, enabling simpler RL training without critic networks

vs others: Simpler than PPO because it eliminates value function training and clipping logic, whereas PPO requires separate critic network and advantage estimation, making RLOO more suitable for simple reward functions

2

AReaLAgent45/100

via “configurable-rl-algorithm-implementation-with-ppo-and-grpo-variants”

The RL Bridge for LLM-based Agent Applications. Made Simple & Flexible.

Unique: Decouples reference model and critic network management from the main training loop, enabling efficient computation of KL penalties and advantage estimates without duplicating model weights in GPU memory. Asynchronous training orchestration allows rollout workers to continue collecting trajectories while the trainer processes previous batches, reducing idle time.

vs others: More flexible than TRL's PPO implementation because it supports multiple algorithm variants and explicit reference model management; more specialized than general RL frameworks like RLlib because it's optimized specifically for language model training with agentic workflows.

3

trlFramework28/100

via “generative-reward-optimization-grpo-training”

Train transformer language models with reinforcement learning.

Unique: Implements unified reward+policy training where the model generates both outputs and rewards in a single forward pass, reducing pipeline complexity compared to RLHF while maintaining explicit reward signals through a learned reward head

vs others: More integrated than RLHF because it eliminates separate reward model training, while more explicit than DPO because it maintains interpretable reward scores that can be inspected and debugged

4

Training language models to follow human instructions with human feedback (InstructGPT)Product22/100

via “proximal policy optimization (ppo) for language model policy optimization”

* ⭐ 03/2022: [Multitask Prompted Training Enables Zero-Shot Task Generalization (T0)](https://arxiv.org/abs/2110.08207)

Unique: Applies PPO with KL regularization to language generation, treating token selection as sequential decisions and using a learned reward model as the optimization signal. The KL penalty against the supervised fine-tuned model prevents reward hacking and maintains general language capabilities while optimizing for human preferences.

vs others: More stable and sample-efficient than vanilla policy gradient methods, and the KL regularization prevents the model from diverging too far from human-like language patterns while still optimizing for preferences, unlike unconstrained RL which can lead to reward hacking.

5

Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization (Retroformer)Product19/100

via “retrospective trajectory optimization via policy gradient learning”

### Other Papers <a name="2023op"></a>

Unique: Applies policy gradient optimization directly to language model action logits using retrospective trajectory data, enabling agents to learn from their own execution history without external reward models or human feedback — a departure from supervised fine-tuning or RLHF approaches that require explicit human preferences

vs others: More sample-efficient than online RL methods because it reuses trajectories already generated during agent deployment, and more scalable than RLHF because it avoids human annotation bottlenecks by learning from task outcomes directly

Top Matches

Also Known As

Company