jaygala24/Qwen2.5-0.5B-GRPO-math-reasoning

TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Apr 13, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

jaygala24/Qwen2.5-0.5B-GRPO-math-reasoning is a 0.5 billion parameter Qwen2.5 model fine-tuned by jaygala24 using Group Relative Policy Optimization (GRPO) without a KL penalty. This model is specifically optimized for mathematical reasoning tasks, leveraging datasets like GSM8K and MATH-500. It demonstrates strong performance on math reasoning benchmarks, making it suitable for applications requiring numerical problem-solving capabilities.

Loading preview...

Overview

This model, jaygala24/Qwen2.5-0.5B-GRPO-math-reasoning, is a specialized fine-tune of the Qwen2.5-0.5B base model. Developed by jaygala24, its core differentiator is the application of Group Relative Policy Optimization (GRPO) without a KL penalty for enhanced mathematical reasoning. The training utilized the PipelineRL framework and focused on datasets such as gsm8k_train and math_train.

Key Capabilities

  • Mathematical Reasoning: Specifically fine-tuned to excel in solving mathematical problems, as evidenced by its training on GSM8K and MATH-500 datasets.
  • GRPO Optimization: Leverages a unique reinforcement learning approach (GRPO with group mean reward as baseline) to improve policy performance in reasoning tasks.
  • Compact Size: At 0.5 billion parameters, it offers a relatively small footprint while delivering competitive performance in its specialized domain.

Evaluation Highlights

The model achieved notable pass@k scores on mathematical benchmarks:

  • GSM8K (test): 51.77% pass@1, 89.76% pass@32
  • MATH-500: 31.18% pass@1, 73.00% pass@32
  • Overall: 46.11% pass@1, 85.16% pass@32

Good for

  • Mathematical Problem Solving: Ideal for applications requiring accurate step-by-step mathematical reasoning.
  • Educational Tools: Can be integrated into systems designed to assist with or evaluate math homework and exercises.
  • Research in RL for Reasoning: Provides a practical example of GRPO's application in fine-tuning LLMs for specific cognitive tasks.