zhaohq/PureRL-7B-v7-s2-l2-maskon

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:May 21, 2026Architecture:Transformer Warm

The zhaohq/PureRL-7B-v7-s2-l2-maskon model is a 7.6 billion parameter language model fine-tuned by zhaohq using the TRL framework. It leverages the GRPO method, as introduced in the DeepSeekMath paper, to enhance its capabilities. This model is specifically optimized for tasks requiring advanced reasoning, particularly in mathematical contexts, making it suitable for applications demanding precise logical inference.

Loading preview...

Model Overview

The zhaohq/PureRL-7B-v7-s2-l2-maskon is a 7.6 billion parameter language model developed by zhaohq. It is a fine-tuned variant, built upon an unspecified base model, and trained using the Transformer Reinforcement Learning (TRL) framework.

Key Differentiator: GRPO Training

A core aspect of this model's development is its training with GRPO (Generalized Reinforcement Learning with Policy Optimization). This method, detailed in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models," is designed to significantly improve a model's reasoning abilities, especially in complex mathematical domains. This suggests the model is optimized for tasks requiring logical deduction and problem-solving.

Training Environment

The model was trained using specific versions of popular frameworks:

  • TRL: 0.16.0.dev0
  • Transformers: 4.57.6
  • PyTorch: 2.10.0
  • Datasets: 4.8.5
  • Tokenizers: 0.22.2

Potential Use Cases

Given its GRPO-enhanced training, this model is likely well-suited for:

  • Mathematical problem-solving
  • Logical reasoning tasks
  • Applications requiring precise and structured outputs

Developers can quickly get started with the provided transformers pipeline example for text generation.