jaygala24/Qwen2.5-0.5B-GRPO-KL-math-reasoning

TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Apr 13, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

jaygala24/Qwen2.5-0.5B-GRPO-KL-math-reasoning is a 0.5 billion parameter Qwen2.5-based language model fine-tuned by jaygala24. It utilizes Group Relative Policy Optimization (GRPO) with a KL penalty to enhance mathematical reasoning capabilities. This model is specifically optimized for solving mathematical problems, as evidenced by its performance on GSM8K and MATH-500 datasets, and has a context length of 32768 tokens.

Loading preview...

Overview

This model, jaygala24/Qwen2.5-0.5B-GRPO-KL-math-reasoning, is a specialized fine-tuned version of the Qwen2.5-0.5B base model. Its primary focus is on mathematical reasoning, achieved through a unique training methodology.

Key Capabilities & Training

  • Mathematical Reasoning: The model is specifically fine-tuned for mathematical problem-solving.
  • GRPO with KL Penalty: It leverages Group Relative Policy Optimization (GRPO), an advanced reinforcement learning algorithm, combined with a KL penalty for training. This method uses the group mean reward as a baseline for relative advantages.
  • Targeted Datasets: Training involved gsm8k_train and math_train datasets, with evaluation on gsm8k_test and math_500.
  • Performance: Achieves an overall pass@1 score of 43.62% and pass@32 of 83.01% across GSM8K and MATH-500 benchmarks, demonstrating strong capabilities in generating correct mathematical solutions given multiple attempts.
  • Efficient Training: Trained using bf16 precision and DeepSpeed ZeRO Stage 3 for optimized resource utilization.

Good For

  • Applications requiring robust mathematical problem-solving.
  • Research into reinforcement learning techniques for language models, particularly GRPO.
  • Developing agents that need to reason step-by-step through mathematical challenges.