jaygala24/Qwen2.5-3B-GRPO-KL-math-reasoning is a 3.1 billion parameter Qwen2.5-3B model fine-tuned by jaygala24 using Group Relative Policy Optimization (GRPO) with a KL penalty. This model is specifically optimized for mathematical reasoning tasks, leveraging datasets like GSM8K and MATH for enhanced performance. With a context length of 32768 tokens, it excels at generating step-by-step mathematical solutions.
Loading preview...
Model Overview
This model, jaygala24/Qwen2.5-3B-GRPO-KL-math-reasoning, is a specialized fine-tune of the 3.1 billion parameter Qwen2.5-3B base model. Developed by jaygala24, its core differentiation lies in its training methodology: Group Relative Policy Optimization (GRPO) with a KL penalty. This reinforcement learning approach, implemented via the PipelineRL framework, is specifically designed to enhance mathematical reasoning capabilities.
Key Capabilities & Training
- Mathematical Reasoning: Optimized for complex mathematical problems, as evidenced by its training on
gsm8k_trainandmath_traindatasets. - GRPO with KL Penalty: Utilizes a sophisticated RL algorithm with specific parameters like a
0.001KL Coefficient and0.02Epsilon (clip) for policy loss. - Robust Training: Trained for 1500 steps with an effective batch size of 256 and a sequence length of 8192, leveraging
bf16precision and DeepSpeed ZeRO Stage 3 for efficiency.
Ideal Use Cases
- Solving Math Problems: Particularly effective for tasks requiring step-by-step mathematical reasoning and final answer extraction.
- Educational Tools: Can be integrated into applications that assist with mathematical problem-solving or provide detailed explanations.
- Research in RL for Reasoning: Serves as a practical example of GRPO application for improving LLM performance on specific cognitive tasks.