jaygala24/Qwen2.5-1.5B-GRPO-KL-math-reasoning

TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Apr 13, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The jaygala24/Qwen2.5-1.5B-GRPO-KL-math-reasoning model is a 1.5 billion parameter Qwen2.5-based language model, fine-tuned using Group Relative Policy Optimization (GRPO) with a KL penalty. It is specifically optimized for mathematical reasoning tasks, demonstrating strong performance on benchmarks like GSM8K and MATH-500. This model is designed for applications requiring robust step-by-step mathematical problem-solving capabilities within a 32K context length.

Loading preview...

Model Overview

This model, jaygala24/Qwen2.5-1.5B-GRPO-KL-math-reasoning, is a specialized fine-tune of the Qwen2.5-1.5B base model. Its primary distinction lies in its training methodology: it leverages Group Relative Policy Optimization (GRPO) with a KL penalty for enhanced mathematical reasoning.

Key Capabilities & Training

  • Mathematical Reasoning: Specifically optimized for solving mathematical problems, as evidenced by its training on gsm8k_train and math_train datasets.
  • GRPO Algorithm: Utilizes GRPO, an advanced reinforcement learning technique, which employs group mean reward as a baseline for relative advantages, contributing to its reasoning prowess.
  • Performance: Achieves notable results on mathematical benchmarks:
    • GSM8K (test): 75.35 pass@1, 96.13 pass@32
    • MATH-500: 54.79 pass@1, 87.20 pass@32
  • Context Length: Supports a sequence length of 8192 tokens during training, indicating suitability for problems requiring substantial context.

Use Cases

This model is particularly well-suited for applications demanding accurate and step-by-step mathematical problem-solving, especially in scenarios where a smaller, specialized model is preferred over larger, general-purpose LLMs.