Name: michaelbzhu/Qwen2.5-Math-1.5B-GSM8K-GRPO API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: michaelbzhu

Model Overview

The michaelbzhu/Qwen2.5-Math-1.5B-GSM8K-GRPO is a 1.5 billion parameter model built upon the Qwen2.5-Math-1.5B base. It has been further fine-tuned using Grouped Reinforcement Policy Optimization (GRPO) on the GSM8K dataset, specifically targeting mathematical reasoning tasks. This fine-tuning process incorporates REINFORCE loss with a baseline and utilizes per-sample length normalization for effective loss aggregation.

Key Capabilities

Mathematical Reasoning: Optimized for solving grade school math problems, as evidenced by its training on the GSM8K dataset.
Structured Output: Designed to produce responses with a distinct thought process (<think>...</think>) and a final answer (<answer>...</answer>), facilitating clear and verifiable solutions.
Reinforcement Learning: Leverages GRPO for enhanced performance in generating correct and well-reasoned mathematical solutions.

Performance

On the GSM8K test set, the model demonstrates:

Correct Format: 1172 out of 1319 responses adhered to the specified <think> and <answer> tag format.
Correct Reward: 966 out of 1319 responses received a correct reward, indicating successful problem-solving.

Training Details

The GRPO fine-tuning involved specific hyperparameters, including a learning rate of 3e-5, 100 GRPO steps, and a rollout batch size of 256. The training process also incorporated a linear learning rate scheduler and used AdamW optimizer. The model's prompt template guides it to first think through the reasoning process and then provide the answer in a structured format.

Overview

Model Overview

Key Capabilities

Performance

Training Details

Full Model Card (README)