michaelbzhu/Qwen2.5-Math-1.5B-GSM8K-GRPO

TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Sep 29, 2025License:mitArchitecture:Transformer Open Weights Cold

The michaelbzhu/Qwen2.5-Math-1.5B-GSM8K-GRPO model is a 1.5 billion parameter language model based on the Qwen2.5 architecture, fine-tuned using GRPO (Grouped Reinforcement Policy Optimization) on the GSM8K mathematical reasoning dataset. This model is specifically optimized for solving grade school math problems, leveraging reinforcement learning with a baseline and per-sample length normalization for loss aggregation. It is designed to generate structured mathematical reasoning processes and answers, making it suitable for tasks requiring step-by-step problem-solving.

Loading preview...

Model Overview

The michaelbzhu/Qwen2.5-Math-1.5B-GSM8K-GRPO is a 1.5 billion parameter model built upon the Qwen2.5-Math-1.5B base. It has been further fine-tuned using Grouped Reinforcement Policy Optimization (GRPO) on the GSM8K dataset, specifically targeting mathematical reasoning tasks. This fine-tuning process incorporates REINFORCE loss with a baseline and utilizes per-sample length normalization for effective loss aggregation.

Key Capabilities

  • Mathematical Reasoning: Optimized for solving grade school math problems, as evidenced by its training on the GSM8K dataset.
  • Structured Output: Designed to produce responses with a distinct thought process (<think>...</think>) and a final answer (<answer>...</answer>), facilitating clear and verifiable solutions.
  • Reinforcement Learning: Leverages GRPO for enhanced performance in generating correct and well-reasoned mathematical solutions.

Performance

On the GSM8K test set, the model demonstrates:

  • Correct Format: 1172 out of 1319 responses adhered to the specified <think> and <answer> tag format.
  • Correct Reward: 966 out of 1319 responses received a correct reward, indicating successful problem-solving.

Training Details

The GRPO fine-tuning involved specific hyperparameters, including a learning rate of 3e-5, 100 GRPO steps, and a rollout batch size of 256. The training process also incorporated a linear learning rate scheduler and used AdamW optimizer. The model's prompt template guides it to first think through the reasoning process and then provide the answer in a structured format.