ExGRPO-Qwen2.5-Math-7B-Zero: Experience-Driven Mathematical Reasoning
The ExGRPO-Qwen2.5-Math-7B-Zero model, developed by rzzhan, is a 7.6 billion parameter language model built upon the Qwen2.5-Math-7B architecture. It implements the novel ExGRPO (Experience-Guided Reinforcement Learning with Prioritized Optimization) framework, which addresses the inefficiencies of traditional on-policy optimization in Reinforcement Learning from Human Feedback (RLHF) for reasoning tasks.
Key Capabilities & Innovations
- Strategic Experience Management: ExGRPO introduces a system to identify, manage, and replay "high-value" experiences during RLHF training, using online proxy metrics like rollout correctness and trajectory entropy to quantify experience quality.
- Enhanced Training Efficiency: By prioritizing and replaying valuable past explorations, the framework enables more efficient and stable RLHF training, mitigating issues like training collapse in challenging scenarios.
- Broad Applicability: The ExGRPO framework demonstrates generalization across various backbone models, including Llama3.1 and other Qwen2.5 variants, for mathematical reasoning.
- Mathematical Reasoning Focus: This specific model is fine-tuned for mathematical problem-solving, building on the Qwen2.5-Math-7B base.
Good For
- Mathematical Reasoning Tasks: Excels in solving complex math problems across benchmarks like AIME, AMC, MATH-500, Minerva, and Olympiad.
- Research in RLHF Optimization: Provides a robust framework for exploring advanced experience replay and management techniques in reinforcement learning for language models.
- Developing Stable Reasoning Agents: Offers a solution for improving the stability and efficiency of training agents for complex, multi-step reasoning processes.