jaygala24/Qwen2.5-3B-GRPO-KL-math-reasoning

TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:Apr 6, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

jaygala24/Qwen2.5-3B-GRPO-KL-math-reasoning is a 3 billion parameter language model fine-tuned from Qwen2.5-3B. It utilizes Group Relative Policy Optimization (GRPO) with a KL penalty to enhance mathematical reasoning capabilities. This model is specifically optimized for solving complex math problems, demonstrating strong performance on benchmarks like GSM8K and MATH-500. It is ideal for applications requiring accurate step-by-step mathematical problem-solving.

Loading preview...

Overview

This model, jaygala24/Qwen2.5-3B-GRPO-KL-math-reasoning, is a specialized fine-tune of the Qwen2.5-3B base model. Its primary distinction lies in its training methodology: it employs Group Relative Policy Optimization (GRPO) with a KL penalty to significantly improve performance on mathematical reasoning tasks. The training leveraged the PipelineRL framework and included datasets such as gsm8k_train and math_train.

Key Capabilities & Performance

  • Enhanced Mathematical Reasoning: Specifically optimized for solving arithmetic and complex mathematical problems.
  • GRPO with KL Penalty: Utilizes an advanced reinforcement learning algorithm for fine-tuning, focusing on policy optimization with a KL divergence constraint.
  • Strong Benchmark Results: Achieves an overall pass@1 score of 79.96% across GSM8K and MATH-500 datasets, with pass@32 reaching 96.15%.
    • GSM8K (test): 85.60% pass@1, 97.95% pass@32.
    • MATH-500: 65.11% pass@1, 91.40% pass@32.
  • Sequence Length: Supports a sequence length of 8192 tokens, suitable for detailed problem-solving steps.

When to Use This Model

  • Mathematical Problem Solving: Ideal for applications requiring accurate and step-by-step solutions to math problems.
  • Educational Tools: Can be integrated into platforms for tutoring or generating math exercises.
  • Research in RL for LLMs: Provides a practical example of GRPO application for specific task improvement.