pawin205/Qwen-7B-REMOR-GRPO-no-SFT

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Apr 21, 2026Architecture:Transformer Cold

pawin205/Qwen-7B-REMOR-GRPO-no-SFT is a 7.6 billion parameter language model fine-tuned from DeepSeek-R1-Distill-Qwen-7B. This model was trained using the GRPO method, which is designed to enhance mathematical reasoning capabilities. It is optimized for tasks requiring advanced mathematical problem-solving and logical deduction.

Loading preview...

Model Overview

pawin205/Qwen-7B-REMOR-GRPO-no-SFT is a 7.6 billion parameter language model derived from the deepseek-ai/DeepSeek-R1-Distill-Qwen-7B base model. It has been specifically fine-tuned using the TRL framework.

Key Differentiator: GRPO Training

The primary distinction of this model lies in its training methodology. It leverages GRPO (Generative Reinforcement learning with Policy Optimization), a technique introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This method is designed to significantly improve the model's proficiency in mathematical reasoning tasks.

Training Details

  • Base Model: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
  • Training Framework: TRL (Transformer Reinforcement Learning)
  • Methodology: GRPO, focused on enhancing mathematical reasoning.

Use Cases

This model is particularly well-suited for applications requiring strong mathematical problem-solving and logical reasoning. Developers can utilize it for tasks where accurate numerical and logical deductions are critical, benefiting from its specialized GRPO training.