leonMW/DeepSeek-R1-Distill-Qwen-7B-GSPO-Basic

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Aug 9, 2025Architecture:Transformer0.0K Cold

The leonMW/DeepSeek-R1-Distill-Qwen-7B-GSPO-Basic is a 7.6 billion parameter language model, fine-tuned from deepseek-ai/DeepSeek-R1-Distill-Qwen-7B. It utilizes the GRPO training method, introduced in the DeepSeekMath paper, to enhance mathematical reasoning capabilities. This model is optimized for tasks requiring advanced reasoning, building upon its base architecture with a 32768 token context length.

Loading preview...

Model Overview

This model, leonMW/DeepSeek-R1-Distill-Qwen-7B-GSPO-Basic, is a fine-tuned variant of the deepseek-ai/DeepSeek-R1-Distill-Qwen-7B base model. It incorporates the GRPO (Generalized Reinforcement Learning from Policy Optimization) training method, which was originally introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300).

Key Capabilities

  • Enhanced Reasoning: Leverages the GRPO training procedure to potentially improve reasoning abilities, particularly in areas related to mathematical problem-solving, as suggested by the method's origin.
  • Instruction Following: As a fine-tuned model, it is designed to follow instructions effectively, making it suitable for various conversational and task-oriented applications.
  • Large Context Window: Maintains the base model's substantial context length of 32768 tokens, allowing it to process and generate longer, more coherent responses.

Training Details

The model was trained using the TRL library (version 0.23.1) and built upon the DeepSeek-R1-Distill-Qwen-7B architecture. The application of GRPO aims to refine its performance beyond the base model's capabilities, focusing on robust and accurate outputs.