jindun/Qwen3-1.7B-GOPD-DeepMath
jindun/Qwen3-1.7B-GOPD-DeepMath is a 2 billion parameter Qwen3-based language model, fine-tuned using ExOPD (Extended Group Relative Policy Optimization) on the DeepMath-103K dataset. This model specializes in mathematical reasoning, particularly on Olympiad-level problems, by learning genuine reasoning skills through trial-and-error exploration rather than imitation. It offers a 32768 token context length and is optimized for complex mathematical problem-solving.
Loading preview...
Overview
jindun/Qwen3-1.7B-GOPD-DeepMath is a 2 billion parameter model built upon the Qwen3-1.7B base architecture. Its key differentiator is the fine-tuning process, which utilizes ExOPD (Extended Group Relative Policy Optimization) on the challenging DeepMath-103K dataset. This approach focuses on developing genuine mathematical reasoning skills through trial-and-error exploration, contrasting with traditional Supervised Fine-Tuning (SFT) which was observed to degrade performance.
Key Capabilities
- Advanced Mathematical Reasoning: Specifically trained on a subset of DeepMath-103K containing 8,000 Olympiad-level problems (difficulty $\ge$ 6).
- Policy Optimization: Employs ExOPD, an algorithm combining Group Relative Policy Optimization (GRPO) with Rollout Correction, to learn robust reasoning strategies.
- Trial-and-Error Learning: Demonstrates that learning through exploration can be more effective for complex reasoning tasks than imitation learning, which showed a -16.67% performance degradation in SFT.
Training Details
The model was trained for 3 epochs with a batch size of 256 and a learning rate of 1e-5. It leveraged Keven16/Qwen3-4B-Non-Thinking-RL-Math-Step500 as a teacher model during the optimization process.
Good For
- Complex Mathematical Problem Solving: Ideal for applications requiring deep mathematical understanding and reasoning, especially for problems at an advanced difficulty level.
- Research in Reinforcement Learning for LLMs: Provides a case study on the effectiveness of policy optimization methods like ExOPD for enhancing reasoning capabilities in language models.