Group Relative Policy Optimization Grpo With Vllm Generation Backend

1

TRLRepository55/100

via “group relative policy optimization (grpo) with vllm generation backend”

Reinforcement learning from human feedback — SFT, DPO, PPO trainers for LLM alignment.

Unique: Dual-mode vLLM integration (server vs colocate) with automatic memory management and weight synchronization, enabling efficient scaling from single-GPU to multi-GPU setups without code changes; built-in reward function composition for combining multiple signals

vs others: Faster than PPO for online RL because GRPO avoids value head training and importance weighting; more flexible than DPO because it supports arbitrary reward functions and online data collection; more scalable than naive RL implementations through vLLM's optimized generation

2

trlFramework28/100

via “generative-reward-optimization-grpo-training”

Train transformer language models with reinforcement learning.

Unique: Implements unified reward+policy training where the model generates both outputs and rewards in a single forward pass, reducing pipeline complexity compared to RLHF while maintaining explicit reward signals through a learned reward head

vs others: More integrated than RLHF because it eliminates separate reward model training, while more explicit than DPO because it maintains interpretable reward scores that can be inspected and debugged

3

GithubRepository25/100

via “reinforcement learning optimization with grpo for ocr quality”

![GitHub Repo stars](https://img.shields.io/github/stars/allenai/olmocr?style=social)|Free|

Unique: Uses GRPO (Group Relative Policy Optimization) rather than standard PPO, reducing variance in reward signals and improving training stability. Integrates directly with the benchmarking framework to generate rewards, creating a tight feedback loop between evaluation and optimization.

vs others: More sample-efficient than standard PPO because GRPO uses group-relative rewards; more aligned with OCR metrics than generic RL because rewards are directly derived from benchmarking scores.

Top Matches

Also Known As

Company