Capability
Parameter Efficient Reasoning Through Rl Scaling
20 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “chain-of-thought reasoning with reinforcement learning optimization”
text-generation model by undefined. 40,25,647 downloads.
Unique: Uses RL-based training to learn dynamic reasoning token allocation per problem, making reasoning depth adaptive rather than fixed; explicitly optimizes for reasoning quality via reward signals rather than implicit capability from instruction tuning
vs others: Outperforms GPT-4 and Claude on AIME/MATH benchmarks by learning to allocate reasoning compute efficiently, while remaining open-source and deployable locally without API dependencies