jaygala24/Qwen3-4B-GRPO-math-reasoning

TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Apr 6, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

jaygala24/Qwen3-4B-GRPO-math-reasoning is a 4 billion parameter model, fine-tuned from Qwen3-4B by jaygala24 using Group Relative Policy Optimization (GRPO) without a KL penalty. This model is specifically optimized for mathematical reasoning tasks, demonstrating strong performance on benchmarks like GSM8K and MATH-500. It is designed to excel in generating step-by-step mathematical solutions.

Loading preview...

Model Overview

This model, jaygala24/Qwen3-4B-GRPO-math-reasoning, is a specialized fine-tune of the Qwen3-4B base model. It has been optimized for mathematical reasoning using Group Relative Policy Optimization (GRPO) without a KL penalty, leveraging the PipelineRL framework.

Key Capabilities & Training

  • Mathematical Reasoning: Specifically trained on gsm8k_train and math_train datasets to enhance its ability to solve mathematical problems.
  • GRPO Optimization: Utilizes GRPO with a policy loss of ppo and a KL coefficient of 0.0, indicating a focus on direct policy improvement.
  • Performance: Achieves notable pass@1 scores of 89.11% on GSM8K (test) and 79.90% on MATH-500, with overall pass@32 reaching 95.66% across both datasets.
  • Training Details: Trained for 1500 steps with a sequence length of 8192 and an effective batch size of 256, using DeepSpeed ZeRO Stage 3 for efficiency.

Good For

  • Applications requiring accurate step-by-step mathematical problem-solving.
  • Tasks involving arithmetic, algebra, and other quantitative reasoning.
  • Developers looking for a Qwen3-4B variant optimized for numerical and logical deduction.