jaygala24/Qwen2.5-0.5B-GRPO-KL-math-reasoning
TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Apr 13, 2026License:apache-2.0Architecture:Transformer Open Weights Cold
jaygala24/Qwen2.5-0.5B-GRPO-KL-math-reasoning is a 0.5 billion parameter Qwen2.5-based language model fine-tuned by jaygala24. It utilizes Group Relative Policy Optimization (GRPO) with a KL penalty to enhance mathematical reasoning capabilities. This model is specifically optimized for solving mathematical problems, as evidenced by its performance on GSM8K and MATH-500 datasets, and has a context length of 32768 tokens.
Loading preview...
Overview
This model, jaygala24/Qwen2.5-0.5B-GRPO-KL-math-reasoning, is a specialized fine-tuned version of the Qwen2.5-0.5B base model. Its primary focus is on mathematical reasoning, achieved through a unique training methodology.
Key Capabilities & Training
- Mathematical Reasoning: The model is specifically fine-tuned for mathematical problem-solving.
- GRPO with KL Penalty: It leverages Group Relative Policy Optimization (GRPO), an advanced reinforcement learning algorithm, combined with a KL penalty for training. This method uses the group mean reward as a baseline for relative advantages.
- Targeted Datasets: Training involved
gsm8k_trainandmath_traindatasets, with evaluation ongsm8k_testandmath_500. - Performance: Achieves an overall pass@1 score of 43.62% and pass@32 of 83.01% across GSM8K and MATH-500 benchmarks, demonstrating strong capabilities in generating correct mathematical solutions given multiple attempts.
- Efficient Training: Trained using
bf16precision and DeepSpeed ZeRO Stage 3 for optimized resource utilization.
Good For
- Applications requiring robust mathematical problem-solving.
- Research into reinforcement learning techniques for language models, particularly GRPO.
- Developing agents that need to reason step-by-step through mathematical challenges.