thangvip/qwen2.5-1.5b-dspo-no-sft-sgd-linear
The thangvip/qwen2.5-1.5b-dspo-no-sft-sgd-linear model is a 1.5 billion parameter language model fine-tuned from Qwen/Qwen2.5-1.5B-Instruct. It utilizes the GRPO training method, as introduced in the DeepSeekMath paper, which focuses on enhancing mathematical reasoning capabilities. This model is primarily designed for tasks requiring improved reasoning, particularly in mathematical contexts, and supports a substantial context length of 131072 tokens.
Loading preview...
Overview
This model, thangvip/qwen2.5-1.5b-dspo-no-sft-sgd-linear, is a 1.5 billion parameter language model built upon the Qwen2.5-1.5B-Instruct architecture. It has been specifically fine-tuned using the GRPO (Gradient Regularized Policy Optimization) method, a technique highlighted in the research behind DeepSeekMath. This training approach aims to significantly improve the model's reasoning abilities, particularly in complex mathematical domains.
Key Capabilities
- Enhanced Reasoning: Leverages the GRPO method for improved logical and mathematical reasoning.
- Large Context Window: Supports a substantial context length of 131072 tokens, allowing for processing extensive inputs.
- TRL Framework: Developed using the TRL (Transformers Reinforcement Learning) library, indicating a reinforcement learning-based fine-tuning process.
Good for
- Mathematical Problem Solving: Ideal for applications requiring robust mathematical reasoning and problem-solving.
- Complex Reasoning Tasks: Suitable for scenarios where a model needs to follow intricate logical steps.
- Research and Development: Useful for exploring the impact of GRPO on smaller, instruction-tuned models.