jaygala24/Qwen3-1.7B-GRPO-math-reasoning is a 2 billion parameter language model, fine-tuned from Qwen3-1.7B using Group Relative Policy Optimization (GRPO) without KL penalty. This model is specifically optimized for mathematical reasoning tasks, leveraging datasets like GSM8K and MATH. With a 32768 token context length, it is designed to provide step-by-step reasoning for complex mathematical problems.
Loading preview...
Overview
This model, jaygala24/Qwen3-1.7B-GRPO-math-reasoning, is a specialized version of the Qwen3-1.7B base model, fine-tuned for enhanced mathematical reasoning capabilities. It utilizes Group Relative Policy Optimization (GRPO) without a KL penalty, a reinforcement learning technique, to improve its performance on math-related tasks.
Key Capabilities
- Mathematical Reasoning: Optimized to process and solve mathematical problems, providing step-by-step reasoning.
- GRPO Fine-tuning: Leverages a specific RL algorithm (GRPO with
ppopolicy loss and0.0KL coefficient) for targeted skill development. - Extensive Training: Trained on a combination of
gsm8k_trainandmath_traindatasets, with evaluation ongsm8k_testandmath_500. - High Context Length: Supports a sequence length of 8192 tokens during training, indicating potential for handling longer problem descriptions.
Good For
- Solving Math Problems: Ideal for applications requiring accurate, reasoned solutions to mathematical queries.
- Research in RL for LLMs: Demonstrates the application of GRPO for fine-tuning language models on specific cognitive tasks.
- Educational Tools: Can be integrated into systems that assist with learning or checking mathematical work.
Training Details
The model was trained with a learning rate of 1e-06 over 1500 steps, using bf16 precision and DeepSpeed ZeRO Stage 3 for efficiency. The training involved 16 rollouts per problem and an effective batch size of 256.