Overview
This model, Shreyansh327/Qwen3-1.7B-grpo-gsm8k, is a 2 billion parameter Qwen3-1.7B base model that has been fine-tuned using Group Relative Policy Optimization (GRPO). Its primary focus is on mathematical reasoning, specifically targeting grade-school level math problems found in the GSM8K dataset.
Key Capabilities
- Structured Chain-of-Thought Reasoning: The model is trained to produce detailed, step-by-step reasoning processes enclosed within
<think>...</think> tags, enhancing transparency and interpretability of its solutions. - Mathematical Problem Solving: Optimized for accuracy on numerical and word problems, particularly those similar to the GSM8K dataset.
- Reward-Based Training: Utilizes a multi-component reward system during training, including rewards for correct formatting, answer accuracy, and a penalty for excessive verbosity.
Training Details
The model was trained using the TRL GRPOTrainer with LoRA (Rank 16, Alpha 32) on the openai/gsm8k training split. It underwent 200 training steps with a cosine learning rate schedule. The training infrastructure included 2x NVIDIA H100 80GB GPUs.
Intended Use
This model is specifically designed for math reasoning tasks, especially those requiring explicit, structured thinking steps. It is best suited for applications where understanding the reasoning process behind a mathematical answer is as important as the answer itself.
Limitations
- Domain Specificity: Primarily trained on GSM8K, its performance may not generalize well to other mathematical domains (e.g., algebra, calculus) or general-purpose tasks.
- Verbosity: The chain-of-thought reasoning can sometimes be verbose, and the model may over-verify its answers.
- Format Dependency: Relies on a specific
<think> tag format for its structured output.