jaygala24/Qwen3-1.7B-GRPO-KL-math-reasoning
jaygala24/Qwen3-1.7B-GRPO-KL-math-reasoning is a 1.7 billion parameter Qwen3 model fine-tuned by jaygala24 for enhanced mathematical reasoning. It utilizes Group Relative Policy Optimization (GRPO) with a KL penalty, trained on GSM8K and MATH datasets. This model is specifically optimized to achieve high pass@k scores on complex math reasoning benchmarks, making it suitable for applications requiring robust arithmetic and logical problem-solving capabilities.
Loading preview...
Model Overview
This model, jaygala24/Qwen3-1.7B-GRPO-KL-math-reasoning, is a specialized fine-tuned version of the Qwen3-1.7B base model. Its primary distinction lies in its training methodology, employing Group Relative Policy Optimization (GRPO) with a KL penalty via the PipelineRL framework, specifically targeting mathematical reasoning tasks.
Key Capabilities & Training
- Mathematical Reasoning: Optimized for solving complex math problems, as evidenced by its strong performance on benchmarks.
- Reinforcement Learning Fine-tuning: Leverages GRPO with a KL coefficient of
0.001and a PPO policy loss, enhancing its ability to generate correct step-by-step reasoning. - Dataset Focus: Trained on
gsm8k_trainandmath_traindatasets, with evaluation ongsm8k_testandmath_500. - Performance: Achieves notable pass@k scores, including an 80.07% pass@1 on GSM8K (test) and 69.64% pass@1 on MATH-500, with overall pass@32 reaching 95.16% across both datasets.
- Technical Stack: Built using PipelineRL, Transformers, and DeepSpeed (ZeRO Stage 3) for efficient training.
Use Cases
This model is particularly well-suited for applications requiring accurate and detailed mathematical problem-solving. Developers should consider this model for:
- Automated Math Tutors: Generating step-by-step solutions for arithmetic and algebraic problems.
- Quantitative Analysis: Assisting in tasks that demand precise numerical reasoning.
- Educational Tools: Providing explanations and answers to mathematical queries.