daviddavidlu/DAPO-with-prompt-augmentation-step2720

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Feb 5, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The daviddavidlu/DAPO-with-prompt-augmentation-step2720 model is a Qwen2.5-Math-1.5B checkpoint, specifically step 2720, developed by Wenquan Lu and his team. It is fine-tuned on the MATH Level-3-to-5 Dataset using the PrAg-PO method, which employs prompt augmentation to enhance reasoning trace diversity and stability during reinforcement learning. This model is primarily designed for robust and diverse mathematical reasoning tasks.

Loading preview...

Overview

This model, daviddavidlu/DAPO-with-prompt-augmentation-step2720, is a specific checkpoint (step 2720) of the Qwen2.5-Math-1.5B model. It was developed by Wenquan Lu and his team as part of the PrAg-PO (Prompt Augmented Policy Optimization) project. The core innovation lies in its training methodology, which involves prompt augmentation to generate diverse reasoning traces, thereby improving rollout diversity and stability during reinforcement learning.

Key Capabilities

  • Mathematical Reasoning: Specifically trained and optimized for solving mathematical problems, particularly from the MATH Level-3-to-5 Dataset.
  • Robustness and Diversity: Leverages prompt augmentation to create varied reasoning paths, enhancing the model's ability to handle diverse problem structures and improve solution robustness.
  • Reinforcement Learning Integration: Utilizes a policy optimization approach with augmented prompts to refine its mathematical problem-solving strategies.

Good for

  • Researchers and developers focused on advanced mathematical reasoning tasks.
  • Applications requiring robust and diverse problem-solving approaches in mathematics.
  • Exploring the impact of prompt augmentation in reinforcement learning for language models.

For more details, refer to the PrAg-PO GitHub repository and the associated research paper: PrAg-PO: Prompt Augmented Policy Optimization for Robust and Diverse Mathematical Reasoning.