thangvip/qwen3-1.7b-dspo-no-sft-exp2

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Feb 3, 2026Architecture:Transformer Warm

The thangvip/qwen3-1.7b-dspo-no-sft-exp2 model is a 2 billion parameter language model, fine-tuned from Qwen/Qwen3-1.7B. It was trained using the GRPO method, as introduced in the DeepSeekMath paper, which focuses on mathematical reasoning. This model is optimized for tasks requiring enhanced reasoning capabilities, leveraging its specialized training approach.

Loading preview...

Model Overview

This model, thangvip/qwen3-1.7b-dspo-no-sft-exp2, is a 2 billion parameter language model derived from the Qwen3-1.7B architecture. It has been specifically fine-tuned using the GRPO (Gradient Regularized Policy Optimization) method, a technique highlighted in the DeepSeekMath research paper. This training approach aims to enhance the model's reasoning abilities, particularly in complex problem-solving scenarios.

Key Capabilities

  • Enhanced Reasoning: Leverages the GRPO training method to improve logical and analytical processing.
  • Qwen3-1.7B Base: Built upon the robust Qwen3-1.7B foundation, providing a strong general language understanding.
  • TRL Framework: Developed using the TRL (Transformers Reinforcement Learning) library, indicating a focus on reinforcement learning from human feedback or similar optimization techniques.

Training Details

The model's training procedure specifically utilized GRPO, as detailed in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This suggests an emphasis on improving performance in tasks that demand structured thought and problem-solving, potentially including mathematical or scientific reasoning. The training was conducted using TRL version 0.28.0.dev0, with Transformers 4.57.6 and PyTorch 2.9.0.