zhaohq/PureRL-7B-v7-s2-corr-maskon

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:May 20, 2026Architecture:Transformer Warm

The zhaohq/PureRL-7B-v7-s2-corr-maskon model is a 7.6 billion parameter language model fine-tuned using the TRL framework. It was trained with GRPO, a method detailed in the DeepSeekMath paper, which focuses on enhancing mathematical reasoning. This model is designed for general text generation tasks, leveraging its specialized training approach to potentially improve response quality.

Loading preview...

Model Overview

The zhaohq/PureRL-7B-v7-s2-corr-maskon is a 7.6 billion parameter language model that has been fine-tuned using the TRL framework. Its training incorporates GRPO (Gradient-based Reinforcement Learning with Policy Optimization), a method introduced in the DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models paper.

Key Capabilities

  • Fine-tuned Performance: Leverages the TRL framework for enhanced instruction following and response generation.
  • GRPO Training: Benefits from a training procedure designed to improve reasoning capabilities, as outlined in the DeepSeekMath research.
  • General Text Generation: Capable of generating coherent and contextually relevant text for a variety of prompts.

Training Details

The model's training procedure utilized GRPO, a method that has shown effectiveness in improving mathematical reasoning in large language models. The training environment included specific versions of key frameworks:

  • TRL: 0.16.0.dev0
  • Transformers: 4.57.6
  • Pytorch: 2.10.0
  • Datasets: 4.8.5
  • Tokenizers: 0.22.2

Good For

  • Developers looking for a 7.6B parameter model fine-tuned with advanced reinforcement learning techniques.
  • Applications requiring general text generation with potentially improved reasoning characteristics due to its GRPO training.
  • Experimentation with models that incorporate methods from mathematical reasoning research.