thangvip/qwen3-1.7b-grpo-sft-base

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Feb 2, 2026Architecture:Transformer Warm

The thangvip/qwen3-1.7b-grpo-sft-base model is a 1.7 billion parameter language model developed by thangvip, fine-tuned from thangvip/qwen3-1.7b-base-sft-math-1500. It utilizes the GRPO training method, introduced in the DeepSeekMath paper, which specializes it for enhanced mathematical reasoning capabilities. This model is primarily designed for tasks requiring robust mathematical problem-solving and logical deduction.

Loading preview...

Overview

thangvip/qwen3-1.7b-grpo-sft-base is a 1.7 billion parameter language model, fine-tuned by thangvip. It builds upon the base model thangvip/qwen3-1.7b-base-sft-math-1500 and incorporates the GRPO (Gradient-based Reward Policy Optimization) training method. This method, detailed in the DeepSeekMath paper, is specifically designed to push the limits of mathematical reasoning in language models.

Key Capabilities

  • Enhanced Mathematical Reasoning: Optimized through GRPO for superior performance on mathematical tasks.
  • Fine-tuned from a Math-focused Base: Benefits from its origin as a math-specialized SFT model.
  • TRL Framework: Trained using the Transformers Reinforcement Learning (TRL) library.

Good for

  • Mathematical Problem Solving: Ideal for applications requiring accurate and robust mathematical reasoning.
  • Research in RLHF for Math: Useful for exploring and building upon GRPO-based training methodologies.
  • Developing Math-centric AI Assistants: Suitable as a foundation for agents focused on numerical and logical challenges.