daviddavidlu/DAPO-with-prompt-augmentation-step2480
daviddavidlu/DAPO-with-prompt-augmentation-step2480 is a 1.5 billion parameter Qwen2.5-Math model fine-tuned using DAPO with prompt augmentation. Developed by Wenquan Lu, Hai Huang, and Randall Balestriero, this model is specifically optimized for mathematical reasoning tasks. It leverages prompt augmentation to enhance rollout diversity and stability during reinforcement learning training, achieving over 80 on the MATH500 benchmark.
Loading preview...
Model Overview
daviddavidlu/DAPO-with-prompt-augmentation-step2480 is a 1.5 billion parameter Qwen2.5-Math model, specifically the step 2480 checkpoint, developed by Wenquan Lu, Hai Huang, and Randall Balestriero. This model was trained using the DAPO (no dynamic sampling) method combined with prompt augmentation on the MATH Level-3-to-5 Dataset. The core innovation lies in its use of prompt augmentation to generate diverse reasoning traces under various templates, which significantly increases rollout diversity and stability during the Reinforcement Learning (RL) training process.
Key Capabilities
- Mathematical Reasoning: Primarily designed and optimized for complex mathematical problem-solving.
- Enhanced RL Training: Utilizes prompt augmentation to improve the stability and diversity of reasoning traces during RL training.
- Performance: Achieves a score of over 80 on the challenging MATH500 benchmark, indicating strong capabilities in mathematical tasks.
Use Cases
This model is ideal for applications requiring robust mathematical reasoning, particularly in scenarios where generating diverse and stable reasoning paths is crucial. It is well-suited for research and development in advanced mathematical AI, educational tools, or any system needing to solve complex math problems.