thangvip/qwen2.5-1.5b-grpo-no-sft-sgd-linear
thangvip/qwen2.5-1.5b-grpo-no-sft-sgd-linear is a 1.5 billion parameter causal language model, fine-tuned from Qwen/Qwen2.5-1.5B-Instruct. This model utilizes the GRPO (Gradient-based Reinforcement Learning with Policy Optimization) method, as introduced in DeepSeekMath, to enhance its capabilities. With a context length of 32768 tokens, it is optimized for tasks that benefit from advanced reasoning techniques. It is particularly suited for applications requiring improved mathematical or logical reasoning, building upon its base Qwen2.5 architecture.
Loading preview...
Model Overview
thangvip/qwen2.5-1.5b-grpo-no-sft-sgd-linear is a 1.5 billion parameter language model, fine-tuned from the Qwen/Qwen2.5-1.5B-Instruct base model. It leverages a significant training innovation: the GRPO (Gradient-based Reinforcement Learning with Policy Optimization) method. This technique, originally introduced in the DeepSeekMath paper, aims to push the limits of mathematical reasoning in open language models.
Key Characteristics
- Base Model: Qwen2.5-1.5B-Instruct, providing a strong foundation for general language understanding and generation.
- Fine-tuning Method: Utilizes GRPO, a reinforcement learning approach, to enhance specific reasoning capabilities.
- Context Length: Supports a substantial context window of 32768 tokens, allowing for processing longer inputs and maintaining coherence over extended interactions.
- Training Framework: Trained using the TRL library, indicating a focus on instruction-following and alignment.
Potential Use Cases
This model is particularly well-suited for applications where improved reasoning, especially in areas like mathematics or complex problem-solving, is beneficial. Its GRPO fine-tuning suggests an advantage in tasks requiring more structured and logical thought processes compared to its base model.