jaygala24/Qwen3-4B-GRPO-math-reasoning
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Apr 6, 2026License:apache-2.0Architecture:Transformer Open Weights Cold
jaygala24/Qwen3-4B-GRPO-math-reasoning is a 4 billion parameter model, fine-tuned from Qwen3-4B by jaygala24 using Group Relative Policy Optimization (GRPO) without a KL penalty. This model is specifically optimized for mathematical reasoning tasks, demonstrating strong performance on benchmarks like GSM8K and MATH-500. It is designed to excel in generating step-by-step mathematical solutions.
Loading preview...
Model Overview
This model, jaygala24/Qwen3-4B-GRPO-math-reasoning, is a specialized fine-tune of the Qwen3-4B base model. It has been optimized for mathematical reasoning using Group Relative Policy Optimization (GRPO) without a KL penalty, leveraging the PipelineRL framework.
Key Capabilities & Training
- Mathematical Reasoning: Specifically trained on
gsm8k_trainandmath_traindatasets to enhance its ability to solve mathematical problems. - GRPO Optimization: Utilizes GRPO with a policy loss of
ppoand a KL coefficient of0.0, indicating a focus on direct policy improvement. - Performance: Achieves notable pass@1 scores of 89.11% on GSM8K (test) and 79.90% on MATH-500, with overall pass@32 reaching 95.66% across both datasets.
- Training Details: Trained for 1500 steps with a sequence length of 8192 and an effective batch size of 256, using DeepSpeed ZeRO Stage 3 for efficiency.
Good For
- Applications requiring accurate step-by-step mathematical problem-solving.
- Tasks involving arithmetic, algebra, and other quantitative reasoning.
- Developers looking for a Qwen3-4B variant optimized for numerical and logical deduction.