michaelbzhu/Qwen2.5-Math-1.5B-GSM8K-GRPO
The michaelbzhu/Qwen2.5-Math-1.5B-GSM8K-GRPO model is a 1.5 billion parameter language model based on the Qwen2.5 architecture, fine-tuned using GRPO (Grouped Reinforcement Policy Optimization) on the GSM8K mathematical reasoning dataset. This model is specifically optimized for solving grade school math problems, leveraging reinforcement learning with a baseline and per-sample length normalization for loss aggregation. It is designed to generate structured mathematical reasoning processes and answers, making it suitable for tasks requiring step-by-step problem-solving.
Loading preview...
Model Overview
The michaelbzhu/Qwen2.5-Math-1.5B-GSM8K-GRPO is a 1.5 billion parameter model built upon the Qwen2.5-Math-1.5B base. It has been further fine-tuned using Grouped Reinforcement Policy Optimization (GRPO) on the GSM8K dataset, specifically targeting mathematical reasoning tasks. This fine-tuning process incorporates REINFORCE loss with a baseline and utilizes per-sample length normalization for effective loss aggregation.
Key Capabilities
- Mathematical Reasoning: Optimized for solving grade school math problems, as evidenced by its training on the GSM8K dataset.
- Structured Output: Designed to produce responses with a distinct thought process (
<think>...</think>) and a final answer (<answer>...</answer>), facilitating clear and verifiable solutions. - Reinforcement Learning: Leverages GRPO for enhanced performance in generating correct and well-reasoned mathematical solutions.
Performance
On the GSM8K test set, the model demonstrates:
- Correct Format: 1172 out of 1319 responses adhered to the specified
<think>and<answer>tag format. - Correct Reward: 966 out of 1319 responses received a correct reward, indicating successful problem-solving.
Training Details
The GRPO fine-tuning involved specific hyperparameters, including a learning rate of 3e-5, 100 GRPO steps, and a rollout batch size of 256. The training process also incorporated a linear learning rate scheduler and used AdamW optimizer. The model's prompt template guides it to first think through the reasoning process and then provide the answer in a structured format.