AbbottYang/Qwen2-0.5B-GRPO-test
AbbottYang/Qwen2-0.5B-GRPO-test is a 0.5 billion parameter causal language model, fine-tuned from Qwen/Qwen2-0.5B-Instruct. This model was trained using the GRPO method, as introduced in the DeepSeekMath paper, which focuses on enhancing mathematical reasoning. It features a 32768 token context length and is optimized for tasks requiring improved reasoning capabilities.
Loading preview...
Model Overview
AbbottYang/Qwen2-0.5B-GRPO-test is a 0.5 billion parameter language model, fine-tuned from the base Qwen/Qwen2-0.5B-Instruct model. It leverages a 32768 token context window.
Key Differentiator: GRPO Training
This model's primary distinction is its training methodology. It was fine-tuned using GRPO (Gradient-based Reward Policy Optimization), a method detailed in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This training approach is designed to enhance the model's reasoning capabilities, particularly in mathematical contexts.
Training Framework
The fine-tuning process was conducted using the TRL library, with specific versions:
- TRL: 0.29.0
- Transformers: 5.3.0
- Pytorch: 2.7.0+cu128
Potential Use Cases
Given its GRPO-based training, this model is potentially suitable for:
- Tasks requiring improved logical reasoning.
- Applications where mathematical problem-solving is a component.
- As a base for further fine-tuning on specific reasoning-intensive datasets.