thangvip/qwen3-1.7b-dspo-no-sft-sgd-linear-6500
The thangvip/qwen3-1.7b-dspo-no-sft-sgd-linear-6500 model is a fine-tuned version of Qwen/Qwen3-1.7B, developed by thangvip. This 1.7 billion parameter model was trained using the GRPO method, which is designed to enhance mathematical reasoning capabilities. It is particularly suited for tasks requiring advanced mathematical problem-solving, building upon the base Qwen3 architecture.
Loading preview...
Model Overview
This model, thangvip/qwen3-1.7b-dspo-no-sft-sgd-linear-6500, is a specialized fine-tuned variant of the Qwen/Qwen3-1.7B base model. It has been developed by thangvip and leverages the TRL (Transformers Reinforcement Learning) framework for its training process.
Key Training Details
- Fine-tuning Method: The model was trained using GRPO (Gradient-based Reward Policy Optimization). This method is notably introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300).
- Frameworks Used: Key frameworks involved in its development include TRL (version 0.28.0.dev0), Transformers (version 4.57.6), PyTorch (version 2.9.0), Datasets (version 4.5.0), and Tokenizers (version 0.22.2).
Intended Use Cases
Given its training with the GRPO method, which focuses on mathematical reasoning, this model is likely optimized for:
- Mathematical problem-solving
- Reasoning tasks that benefit from enhanced logical and quantitative understanding.
Developers can quickly integrate this model using the provided transformers pipeline for text generation tasks.