xiwenc1/OpenRS-DR_GRPO_dra-qwen2
The xiwenc1/OpenRS-DR_GRPO_dra-qwen2 model is a 3.1 billion parameter instruction-tuned language model based on the Qwen2.5-3B-Instruct architecture. It has been fine-tuned using the GRPO method on the knoveleng/open-rs dataset, specializing it for enhanced mathematical reasoning. This model is designed for tasks requiring robust problem-solving capabilities, particularly in mathematical contexts, and supports a 32768-token context length.
Loading preview...
Model Overview
xiwenc1/OpenRS-DR_GRPO_dra-qwen2 is a 3.1 billion parameter language model, fine-tuned from the Qwen2.5-3B-Instruct base model. This model leverages the GRPO (Gradient-based Reward Policy Optimization) method, as introduced in the "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" paper, to enhance its reasoning capabilities.
Key Capabilities
- Mathematical Reasoning: Specialized through GRPO training, making it suitable for tasks requiring logical and mathematical problem-solving.
- Instruction Following: Inherits strong instruction-following abilities from its Qwen2.5-3B-Instruct base.
- Extended Context: Supports a substantial context length of 32768 tokens, allowing for processing longer inputs and more complex problem descriptions.
Training Details
The model was fine-tuned on the knoveleng/open-rs dataset using the TRL (Transformer Reinforcement Learning) framework. The application of the GRPO method specifically targets improvements in mathematical reasoning, differentiating it from general-purpose instruction-tuned models.
Good For
- Applications requiring advanced mathematical problem-solving.
- Tasks where robust reasoning and logical deduction are critical.
- Developers looking for a compact yet capable model for specialized reasoning tasks.