GRPO-LEAD: Efficient Mathematical Reasoning
PlanePaper/LEAD-7B is a 7.6 billion parameter model developed using the GRPO-LEAD (GRPO with Length-dependent rewards, Explicit penalties, and Advantage reweighting for Difficulty) reinforcement learning pipeline. This advanced fine-tuning approach focuses on enhancing LLMs for concise, accurate, and efficient reasoning in complex mathematical tasks.
Key Capabilities & Performance
- Superior Mathematical Reasoning: Achieves higher consistency and accuracy on challenging AIME24 and AIME25 datasets.
- Efficiency: Demonstrates significantly shorter average reasoning lengths compared to larger 14B models like DeepSeek-Distilled-14B and Light-R1-14B-DS, indicating more efficient problem-solving.
- Optimized for Difficulty: Trained on a curated dataset, GRPO-LEAD-SFTData, which includes 12,153 high-quality mathematical reasoning samples with a focus on problems with difficulty > 1.
When to Use This Model
- Mathematical Problem Solving: Ideal for applications requiring precise and efficient step-by-step mathematical reasoning.
- Concise Explanations: Suited for scenarios where not only the correct answer but also a streamlined, shorter reasoning path is desired.
- Research in RLHF for Reasoning: Provides a strong baseline and methodology for further exploration in reinforcement learning for mathematical tasks.
For detailed implementation and further exploration, refer to the GitHub Repository and the associated GRPO-LEAD-SFTData dataset.