jaygala24/Qwen3-1.7B-GRPO-math-reasoning
TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Apr 6, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

jaygala24/Qwen3-1.7B-GRPO-math-reasoning is a 2 billion parameter language model, fine-tuned from Qwen3-1.7B using Group Relative Policy Optimization (GRPO) without KL penalty. This model is specifically optimized for mathematical reasoning tasks, leveraging datasets like GSM8K and MATH. With a 32768 token context length, it is designed to provide step-by-step reasoning for complex mathematical problems.

Loading preview...

Overview

This model, jaygala24/Qwen3-1.7B-GRPO-math-reasoning, is a specialized version of the Qwen3-1.7B base model, fine-tuned for enhanced mathematical reasoning capabilities. It utilizes Group Relative Policy Optimization (GRPO) without a KL penalty, a reinforcement learning technique, to improve its performance on math-related tasks.

Key Capabilities

  • Mathematical Reasoning: Optimized to process and solve mathematical problems, providing step-by-step reasoning.
  • GRPO Fine-tuning: Leverages a specific RL algorithm (GRPO with ppo policy loss and 0.0 KL coefficient) for targeted skill development.
  • Extensive Training: Trained on a combination of gsm8k_train and math_train datasets, with evaluation on gsm8k_test and math_500.
  • High Context Length: Supports a sequence length of 8192 tokens during training, indicating potential for handling longer problem descriptions.

Good For

  • Solving Math Problems: Ideal for applications requiring accurate, reasoned solutions to mathematical queries.
  • Research in RL for LLMs: Demonstrates the application of GRPO for fine-tuning language models on specific cognitive tasks.
  • Educational Tools: Can be integrated into systems that assist with learning or checking mathematical work.

Training Details

The model was trained with a learning rate of 1e-06 over 1500 steps, using bf16 precision and DeepSpeed ZeRO Stage 3 for efficiency. The training involved 16 rollouts per problem and an effective batch size of 256.