jaygala24/Qwen2.5-1.5B-GRPO-math-reasoning

TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Apr 13, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The jaygala24/Qwen2.5-1.5B-GRPO-math-reasoning model is a 1.5 billion parameter Qwen2.5-based language model, fine-tuned using Group Relative Policy Optimization (GRPO) without KL penalty. It is specifically optimized for mathematical reasoning tasks, demonstrating strong performance on benchmarks like GSM8K and MATH-500. This model excels at generating step-by-step solutions for complex math problems, making it suitable for applications requiring precise numerical and logical deduction.

Loading preview...

Overview

This model, jaygala24/Qwen2.5-1.5B-GRPO-math-reasoning, is a specialized fine-tune of the Qwen2.5-1.5B base model. Its primary distinction lies in its training methodology: it utilizes Group Relative Policy Optimization (GRPO) without KL penalty, a reinforcement learning technique from PipelineRL, to enhance mathematical reasoning capabilities.

Key Capabilities & Performance

  • Mathematical Reasoning: Specifically optimized for solving mathematical problems, as evidenced by its training on gsm8k_train and math_train datasets.
  • Strong Benchmark Results: Achieves notable pass@k scores on challenging math benchmarks:
    • GSM8K (test): 75.18% pass@1, 96.89% pass@32
    • MATH-500: 54.73% pass@1, 87.00% pass@32
    • Overall: 69.56% pass@1, 94.17% pass@32 across 1819 problems.
  • Efficient Training: Leverages advanced techniques like DeepSpeed ZeRO Stage 3 and bf16 precision during its 1500-step training process.

When to Use This Model

  • Mathematical Problem Solving: Ideal for applications requiring accurate, step-by-step solutions to arithmetic and advanced mathematical problems.
  • Educational Tools: Can be integrated into platforms for tutoring or generating explanations for math concepts.
  • Research in RL for Reasoning: Provides a strong baseline for further experimentation with GRPO and similar reinforcement learning approaches in language models.