zhaohq/PureRL-7B-v7-stage1-conf-tag-instruct

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:May 22, 2026Architecture:Transformer0.0K Warm

zhaohq/PureRL-7B-v7-stage1-conf-tag-instruct is a 7.6 billion parameter instruction-tuned language model, fine-tuned from Qwen/Qwen2.5-7B-Instruct. It was trained using the TRL framework and the GRPO method, which is designed to enhance mathematical reasoning in large language models. This model is optimized for tasks requiring robust reasoning capabilities, particularly in mathematical contexts, leveraging techniques from DeepSeekMath.

Loading preview...

Model Overview

zhaohq/PureRL-7B-v7-stage1-conf-tag-instruct is a 7.6 billion parameter instruction-tuned model built upon the Qwen/Qwen2.5-7B-Instruct architecture. This model distinguishes itself through its specialized training methodology, utilizing the TRL (Transformer Reinforcement Learning) framework.

Key Capabilities & Training

  • Enhanced Reasoning: The model was trained with GRPO (Gradient-based Reward Policy Optimization), a method introduced in the DeepSeekMath paper. This technique is specifically designed to push the limits of mathematical reasoning in open language models.
  • Fine-tuned Performance: By fine-tuning the robust Qwen2.5-7B-Instruct base with GRPO, this model aims to improve performance on tasks that benefit from advanced reasoning and problem-solving.
  • Framework Versions: The training utilized TRL 0.16.0.dev0, Transformers 4.57.6, Pytorch 2.10.0, Datasets 4.8.5, and Tokenizers 0.22.2.

Use Cases

This model is particularly well-suited for applications requiring:

  • Mathematical Problem Solving: Its GRPO training suggests strong performance in tasks involving mathematical reasoning and complex calculations.
  • Instruction Following: As an instruction-tuned model, it is designed to accurately follow user prompts and generate relevant responses.
  • Research and Development: Developers and researchers can leverage this model for exploring advanced reasoning capabilities in LLMs, especially in areas related to mathematics and logic.