pawin205/Qwen-7B-REMOR-GRPO-no-SFT
pawin205/Qwen-7B-REMOR-GRPO-no-SFT is a 7.6 billion parameter language model fine-tuned from DeepSeek-R1-Distill-Qwen-7B. This model was trained using the GRPO method, which is designed to enhance mathematical reasoning capabilities. It is optimized for tasks requiring advanced mathematical problem-solving and logical deduction.
Loading preview...
Model Overview
pawin205/Qwen-7B-REMOR-GRPO-no-SFT is a 7.6 billion parameter language model derived from the deepseek-ai/DeepSeek-R1-Distill-Qwen-7B base model. It has been specifically fine-tuned using the TRL framework.
Key Differentiator: GRPO Training
The primary distinction of this model lies in its training methodology. It leverages GRPO (Generative Reinforcement learning with Policy Optimization), a technique introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This method is designed to significantly improve the model's proficiency in mathematical reasoning tasks.
Training Details
- Base Model:
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B - Training Framework: TRL (Transformer Reinforcement Learning)
- Methodology: GRPO, focused on enhancing mathematical reasoning.
Use Cases
This model is particularly well-suited for applications requiring strong mathematical problem-solving and logical reasoning. Developers can utilize it for tasks where accurate numerical and logical deductions are critical, benefiting from its specialized GRPO training.