jaygala24/Qwen2.5-3B-GRPO-KL-math-reasoning
TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:Apr 6, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

jaygala24/Qwen2.5-3B-GRPO-KL-math-reasoning is a 3.1 billion parameter Qwen2.5-3B model fine-tuned by jaygala24 using Group Relative Policy Optimization (GRPO) with a KL penalty. This model is specifically optimized for mathematical reasoning tasks, leveraging datasets like GSM8K and MATH for enhanced performance. With a context length of 32768 tokens, it excels at generating step-by-step mathematical solutions.

Loading preview...

Model Overview

This model, jaygala24/Qwen2.5-3B-GRPO-KL-math-reasoning, is a specialized fine-tune of the 3.1 billion parameter Qwen2.5-3B base model. Developed by jaygala24, its core differentiation lies in its training methodology: Group Relative Policy Optimization (GRPO) with a KL penalty. This reinforcement learning approach, implemented via the PipelineRL framework, is specifically designed to enhance mathematical reasoning capabilities.

Key Capabilities & Training

  • Mathematical Reasoning: Optimized for complex mathematical problems, as evidenced by its training on gsm8k_train and math_train datasets.
  • GRPO with KL Penalty: Utilizes a sophisticated RL algorithm with specific parameters like a 0.001 KL Coefficient and 0.02 Epsilon (clip) for policy loss.
  • Robust Training: Trained for 1500 steps with an effective batch size of 256 and a sequence length of 8192, leveraging bf16 precision and DeepSpeed ZeRO Stage 3 for efficiency.

Ideal Use Cases

  • Solving Math Problems: Particularly effective for tasks requiring step-by-step mathematical reasoning and final answer extraction.
  • Educational Tools: Can be integrated into applications that assist with mathematical problem-solving or provide detailed explanations.
  • Research in RL for Reasoning: Serves as a practical example of GRPO application for improving LLM performance on specific cognitive tasks.