CohenQu/DeepSeek-R1-Distill-Qwen-7B-GRPO
TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kArchitecture:Transformer0.0K Cold

CohenQu/DeepSeek-R1-Distill-Qwen-7B-GRPO is a 7.6 billion parameter language model fine-tuned from agentica-org/DeepScaleR-1.5B-Preview. It utilizes the GRPO (Generative Reinforcement Learning with Policy Optimization) method, as introduced in the DeepSeekMath paper, to enhance its capabilities. This model is specifically trained using the TRL framework on the hf-cmu-collab/DeepScaleR-1.5B-Preview_on-policy_GRPO dataset, suggesting an optimization for tasks related to mathematical reasoning or complex problem-solving.

Loading preview...