daviddavidlu/PrAg-PO-Qwen3-1.7b-step720
The daviddavidlu/PrAg-PO-Qwen3-1.7b-step720 is a 1.7 billion parameter Qwen3-based language model, specifically a checkpoint from training on the MATH Level-3-to-5 Dataset. Developed by daviddavidlu, this model utilizes Prompt Augmented Policy Optimization (PrAg-PO) to enhance mathematical reasoning. It is optimized for robust and diverse mathematical problem-solving by generating reasoning traces under varied templates.
Loading preview...
Model Overview
The daviddavidlu/PrAg-PO-Qwen3-1.7b-step720 is a 1.7 billion parameter model based on the Qwen3 architecture, representing a specific checkpoint (step 720) from its training process. This model is a product of research detailed in the paper "PrAg-PO: Prompt Augmented Policy Optimization for Robust and Diverse Mathematical Reasoning" by Wenquan Lu, Hai Huang, Enqi Liu, and Randall Balestriero.
Key Capabilities
- Mathematical Reasoning: The model is specifically trained and intended for advanced mathematical reasoning tasks, particularly on problems from the MATH Level-3-to-5 Dataset.
- Prompt Augmented Policy Optimization (PrAg-PO): It incorporates a novel training procedure that leverages prompt augmentation to generate diverse reasoning traces. This technique aims to increase rollout diversity and stability during reinforcement learning (RL) training.
- Robust Problem Solving: By using diverse templates for reasoning traces, the model is designed to achieve more robust and varied approaches to solving mathematical problems.
Good For
- Research in Mathematical AI: Ideal for researchers exploring advanced techniques in mathematical reasoning and reinforcement learning for language models.
- Benchmarking Mathematical Performance: Can be used to evaluate and compare the performance of LLMs on complex mathematical datasets like MATH Level-3-to-5.
- Developing Robust Reasoning Systems: Suitable for applications requiring models that can generate diverse and stable reasoning paths for problem-solving.