thangvip/qwen3-1.7b-dspo-no-sft-sgd-linear
The thangvip/qwen3-1.7b-dspo-no-sft-sgd-linear model is a 2 billion parameter language model, fine-tuned from Qwen/Qwen3-1.7B. It utilizes the GRPO training method, as introduced in the DeepSeekMath paper, for enhanced performance. This model is specifically adapted for tasks requiring advanced reasoning, leveraging its specialized training approach. Its 40960-token context length supports processing extensive inputs for complex problem-solving.
Loading preview...
Model Overview
thangvip/qwen3-1.7b-dspo-no-sft-sgd-linear is a 2 billion parameter language model, fine-tuned from the base Qwen/Qwen3-1.7B architecture. This model distinguishes itself through its training methodology, employing the GRPO (Generalized Reinforcement Learning with Policy Optimization) technique. GRPO is a method highlighted in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300), suggesting an optimization for reasoning-intensive tasks.
Key Characteristics
- Base Model: Qwen3-1.7B, a robust foundation for language understanding.
- Training Method: Fine-tuned using GRPO, a specialized technique for improving reasoning capabilities, implemented via the TRL library.
- Context Length: Supports a substantial context window of 40960 tokens, enabling the processing of longer and more complex inputs.
Potential Use Cases
Given its GRPO-based training, this model is likely well-suited for applications demanding:
- Complex Reasoning: Tasks that require logical deduction, problem-solving, or structured thinking.
- Mathematical Applications: While not explicitly stated as a math model, its training method's origin in DeepSeekMath suggests potential benefits for mathematical reasoning tasks.
- Advanced Language Understanding: Leveraging its large context window for nuanced comprehension of extensive texts.