CohenQu/DeepSeek-R1-Distill-Qwen-7B-GRPO

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kArchitecture:Transformer0.0K Cold

CohenQu/DeepSeek-R1-Distill-Qwen-7B-GRPO is a 7.6 billion parameter language model fine-tuned from agentica-org/DeepScaleR-1.5B-Preview. It utilizes the GRPO (Generative Reinforcement Learning with Policy Optimization) method, as introduced in the DeepSeekMath paper, to enhance its capabilities. This model is specifically trained using the TRL framework on the hf-cmu-collab/DeepScaleR-1.5B-Preview_on-policy_GRPO dataset, suggesting an optimization for tasks related to mathematical reasoning or complex problem-solving.

Loading preview...

Model Overview

CohenQu/DeepSeek-R1-Distill-Qwen-7B-GRPO is a 7.6 billion parameter language model derived from agentica-org/DeepScaleR-1.5B-Preview. It has been fine-tuned using the TRL (Transformer Reinforcement Learning) framework.

Key Training Methodology

This model's primary differentiator is its training with GRPO (Generative Reinforcement Learning with Policy Optimization). This method was originally introduced in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models," indicating a focus on improving complex reasoning and problem-solving abilities, particularly in mathematical contexts. The training utilized the hf-cmu-collab/DeepScaleR-1.5B-Preview_on-policy_GRPO dataset.

Framework Versions

The model was developed using specific versions of key frameworks:

  • TRL: 0.15.0.dev0
  • Transformers: 4.49.0.dev0
  • Pytorch: 2.5.1
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Potential Use Cases

Given its GRPO training, this model is likely well-suited for applications requiring advanced reasoning, logical deduction, and potentially mathematical problem-solving, similar to the objectives of the DeepSeekMath research.