jaygala24/Qwen2.5-3B-GRPO-math-reasoning

TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:Apr 6, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The jaygala24/Qwen2.5-3B-GRPO-math-reasoning model is a 3 billion parameter Qwen2.5-based language model fine-tuned by jaygala24. It utilizes Group Relative Policy Optimization (GRPO) without a KL penalty, specifically optimized for mathematical reasoning tasks. This model demonstrates strong performance on benchmarks like GSM8K and MATH-500, making it suitable for applications requiring accurate step-by-step mathematical problem-solving.

Loading preview...

Model Overview

This model, jaygala24/Qwen2.5-3B-GRPO-math-reasoning, is a specialized fine-tune of the Qwen2.5-3B base model. Its primary distinction lies in its training methodology: it leverages Group Relative Policy Optimization (GRPO) without a KL penalty, a reinforcement learning technique designed to enhance mathematical reasoning capabilities.

Key Capabilities & Training

  • Mathematical Reasoning: Specifically optimized for solving mathematical problems, as evidenced by its strong performance on relevant benchmarks.
  • GRPO Fine-tuning: Utilizes a unique RL algorithm (GRPO with KL coefficient 0.0) for policy optimization, trained with PipelineRL.
  • Dataset Focus: Trained on a combination of gsm8k_train and math_train datasets, ensuring exposure to diverse mathematical problems.
  • Evaluation: Achieves notable pass@k scores on mathematical reasoning benchmarks:
    • GSM8K (test): 84.45% pass@1
    • MATH-500: 64.48% pass@1
    • Overall: 78.96% pass@1 across 1819 problems.

When to Use This Model

This model is particularly well-suited for applications requiring:

  • Accurate Mathematical Problem Solving: Ideal for tasks that demand step-by-step reasoning to arrive at a numerical or logical mathematical answer.
  • Educational Tools: Can be integrated into systems for generating solutions or explanations for math problems.
  • Research in RL for Reasoning: Provides a strong baseline for exploring the impact of GRPO on complex reasoning tasks.

Its specialized training makes it a robust choice for focused mathematical applications, offering competitive performance within its parameter class.