daviddavidlu/DAPO-with-prompt-augmentation-step2480

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Feb 5, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

daviddavidlu/DAPO-with-prompt-augmentation-step2480 is a Qwen2.5-Math-1.5B checkpoint, specifically step 2480, developed by Wenquan Lu and his team. This model is fine-tuned using Prompt Augmented Policy Optimization (PrAg-PO) for robust and diverse mathematical reasoning. It achieves over 80 on the MATH500 benchmark, making it highly specialized for complex mathematical problem-solving.

Loading preview...

Overview

This model, daviddavidlu/DAPO-with-prompt-augmentation-step2480, is a specialized checkpoint of the Qwen2.5-Math-1.5B architecture, developed by Wenquan Lu and his team. It represents step 2480 in a training process that utilizes PrAg-PO (Prompt Augmented Policy Optimization). The core innovation lies in its training methodology, which employs prompt augmentation to generate diverse reasoning traces, enhancing rollout diversity and stability during reinforcement learning.

Key Capabilities

  • Advanced Mathematical Reasoning: Specifically designed and optimized for solving complex mathematical problems.
  • Robustness and Diversity: Leverages prompt augmentation to improve the robustness and diversity of its reasoning traces.
  • Benchmark Performance: Achieves a score of over 80 on the challenging MATH500 dataset, indicating strong performance in mathematical tasks.

Training Methodology

The model's training procedure is detailed in the paper "PrAg-PO: Prompt Augmented Policy Optimization for Robust and Diverse Mathematical Reasoning" (arXiv:2602.03190). This method focuses on increasing the variety and stability of reasoning paths during RL training, which is crucial for tackling intricate mathematical problems.

Good For

  • Mathematical Problem Solving: Ideal for applications requiring high-accuracy mathematical reasoning.
  • Research in RL and Prompt Engineering: Provides a practical example of PrAg-PO's effectiveness in enhancing model capabilities for specific domains.