jaygala24/Qwen3-1.7B-GRPO-math-reasoning
The jaygala24/Qwen3-1.7B-GRPO-math-reasoning model is a fine-tuned version of Qwen3-1.7B, specifically optimized for mathematical reasoning tasks. It was trained using GRPO (Group Relative Policy Optimization) without a KL penalty on GSM8K and MATH datasets. This model demonstrates strong performance on math reasoning benchmarks, achieving an overall pass@32 of 95.05% across GSM8K and MATH-500 datasets. It is designed for applications requiring accurate step-by-step mathematical problem-solving.
Loading preview...
Overview
This model, jaygala24/Qwen3-1.7B-GRPO-math-reasoning, is a specialized fine-tuned variant of the Qwen3-1.7B base model. Its primary focus is on enhancing mathematical reasoning capabilities through a unique training methodology.
Key Capabilities
- Mathematical Reasoning: Specifically optimized for solving mathematical problems, as evidenced by its training on
gsm8k_trainandmath_traindatasets. - GRPO Training: Utilizes Group Relative Policy Optimization (GRPO) without a KL penalty, a reinforcement learning technique, to refine its reasoning process.
- Strong Benchmark Performance: Achieves notable results on math reasoning benchmarks:
- GSM8K (test): 79.73% pass@1, 95.38% pass@32
- MATH-500: 69.84% pass@1, 94.20% pass@32
- Overall: 77.01% pass@1, 95.05% pass@32 across 1819 problems.
Training Details
The model was trained using PipelineRL with specific hyperparameters including a learning rate of 1e-06, a sequence length of 8192, and bf16 precision, leveraging DeepSpeed ZeRO Stage 3.
When to Use This Model
This model is ideal for use cases requiring robust and accurate mathematical problem-solving, particularly those involving step-by-step reasoning. Its fine-tuning on dedicated math datasets makes it a strong candidate for educational tools, automated problem solvers, or any application where precise numerical and logical deduction is critical.