Capability
Direct Preference Optimization Dpo Training With Rlhf Integration
20 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “trl (transformer reinforcement learning) fine-tuning compatibility”
text-generation model by undefined. 71,06,872 downloads.
Unique: Explicitly designed as a minimal test harness for TRL library — uses standard Qwen2 architecture with no custom RL-specific modifications, enabling TRL training scripts to run without model-specific adaptations
vs others: Faster training iteration than full-size models but with limited transfer to production; compatible with TRL ecosystem but requires external reward models and preference data