daviddavidlu/DAPO-with-prompt-augmentation-step2820

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Feb 3, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The daviddavidlu/DAPO-with-prompt-augmentation-step2820 is a Qwen2.5-Math-1.5B checkpoint, developed by Wenquan Lu et al., specifically trained for mathematical reasoning tasks. It utilizes Prompt Augmented Policy Optimization (PrAg-PO) to generate diverse reasoning traces, enhancing robustness and stability during reinforcement learning. This model is optimized for solving MATH Level-3-to-5 problems, leveraging prompt augmentation for improved performance in complex mathematical problem-solving.

Loading preview...

Model Overview

This model, daviddavidlu/DAPO-with-prompt-augmentation-step2820, is a checkpoint from the PrAg-PO training of a Qwen2.5-Math-1.5B base model. Developed by Wenquan Lu et al., it is specifically designed for advanced mathematical reasoning. The training methodology, detailed in the paper "PrAg-PO: Prompt Augmented Policy Optimization for Robust and Diverse Mathematical Reasoning," focuses on enhancing the model's ability to solve complex math problems.

Key Capabilities

  • Mathematical Reasoning: Specialized in tackling problems from the MATH Level-3-to-5 Dataset.
  • Prompt Augmentation: Employs Prompt Augmented Policy Optimization (PrAg-PO) to generate varied reasoning traces.
  • Robustness and Diversity: The prompt augmentation technique increases rollout diversity and stability during reinforcement learning (RL) training, leading to more robust problem-solving.

When to Use This Model

This model is particularly suitable for:

  • Solving challenging mathematical problems: Specifically those at the MATH Level-3-to-5 difficulty.
  • Research in mathematical reasoning: Ideal for exploring the impact of prompt augmentation and policy optimization in LLMs.
  • Applications requiring robust and diverse reasoning paths: Where a single, fixed reasoning approach might be insufficient.

For potentially better performance, users are encouraged to explore other checkpoints like step 2720 and step 2480.