thangvip/qwen3-1.7b-grpo-sft-base
The thangvip/qwen3-1.7b-grpo-sft-base model is a 1.7 billion parameter language model developed by thangvip, fine-tuned from thangvip/qwen3-1.7b-base-sft-math-1500. It utilizes the GRPO training method, introduced in the DeepSeekMath paper, which specializes it for enhanced mathematical reasoning capabilities. This model is primarily designed for tasks requiring robust mathematical problem-solving and logical deduction.
Loading preview...
Overview
thangvip/qwen3-1.7b-grpo-sft-base is a 1.7 billion parameter language model, fine-tuned by thangvip. It builds upon the base model thangvip/qwen3-1.7b-base-sft-math-1500 and incorporates the GRPO (Gradient-based Reward Policy Optimization) training method. This method, detailed in the DeepSeekMath paper, is specifically designed to push the limits of mathematical reasoning in language models.
Key Capabilities
- Enhanced Mathematical Reasoning: Optimized through GRPO for superior performance on mathematical tasks.
- Fine-tuned from a Math-focused Base: Benefits from its origin as a math-specialized SFT model.
- TRL Framework: Trained using the Transformers Reinforcement Learning (TRL) library.
Good for
- Mathematical Problem Solving: Ideal for applications requiring accurate and robust mathematical reasoning.
- Research in RLHF for Math: Useful for exploring and building upon GRPO-based training methodologies.
- Developing Math-centric AI Assistants: Suitable as a foundation for agents focused on numerical and logical challenges.