jaygala24/Qwen3-4B-GRPO-KL-math-reasoning

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Apr 6, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

jaygala24/Qwen3-4B-GRPO-KL-math-reasoning is a fine-tuned version of the Qwen3-4B causal language model, specifically optimized for mathematical reasoning tasks. This model leverages Group Relative Policy Optimization (GRPO) with a KL penalty, trained on datasets like GSM8K and MATH-500. It demonstrates strong performance on mathematical benchmarks, achieving an overall pass@1 of 87.15% and pass@32 of 96.10% across GSM8K and MATH-500 datasets. Its primary strength lies in accurately solving complex math problems through step-by-step reasoning.

Loading preview...

jaygala24/Qwen3-4B-GRPO-KL-math-reasoning: Enhanced Mathematical Reasoning

This model is a specialized fine-tune of the Qwen3-4B base model, developed by jaygala24, focusing on advanced mathematical reasoning capabilities. It utilizes Group Relative Policy Optimization (GRPO) with a KL penalty, a reinforcement learning technique, to significantly improve its performance on complex math problems.

Key Capabilities & Training

  • Mathematical Reasoning: Specifically trained and optimized for solving mathematical problems, including arithmetic and word problems.
  • GRPO Fine-tuning: Employs GRPO with a KL coefficient of 0.001 and a policy loss of ppo for robust policy optimization.
  • Comprehensive Training Data: Fine-tuned on a combination of gsm8k_train and math_train datasets, ensuring exposure to a wide range of mathematical challenges.
  • High Sequence Length: Trained with a sequence length of 8192, allowing for processing longer problem descriptions and reasoning steps.

Performance Highlights

Evaluated on standard mathematical benchmarks, the model demonstrates strong results:

  • GSM8K (test): Achieves a pass@1 of 89.47% and pass@32 of 96.13%.
  • MATH-500: Achieves a pass@1 of 81.04% and pass@32 of 96.00%.
  • Overall: Boasts an impressive overall pass@1 of 87.15% and pass@32 of 96.10% across 1819 problems.

Ideal Use Cases

This model is particularly well-suited for applications requiring accurate and detailed step-by-step mathematical problem-solving, such as:

  • Educational tools for math assistance.
  • Automated problem solvers for quantitative tasks.
  • Research in improving LLM mathematical reasoning.