zhaohq/PureRL-7B-v7-stage1-reasoning-qa

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:May 20, 2026Architecture:Transformer Warm

zhaohq/PureRL-7B-v7-stage1-reasoning-qa is a 7.6 billion parameter language model fine-tuned from Qwen/Qwen2.5-7B. Developed by zhaohq, this model specializes in reasoning and question-answering tasks, leveraging the GRPO training method. It is designed to enhance mathematical and general reasoning capabilities, making it suitable for complex analytical queries.

Loading preview...

Model Overview

zhaohq/PureRL-7B-v7-stage1-reasoning-qa is a 7.6 billion parameter language model built upon the Qwen/Qwen2.5-7B architecture. This model has undergone specialized fine-tuning using the TRL (Transformer Reinforcement Learning) framework, specifically incorporating the GRPO (Gradient-based Reward Policy Optimization) method.

Key Capabilities

  • Enhanced Reasoning: The model is specifically optimized for reasoning tasks, drawing from advancements in mathematical reasoning as seen in the DeepSeekMath paper.
  • Question Answering: It demonstrates proficiency in handling complex question-answering scenarios, making it suitable for applications requiring analytical responses.
  • GRPO Training: Utilizes the GRPO method, as detailed in the DeepSeekMath paper, to improve its reasoning abilities.

Training Details

The training process was tracked and can be visualized via Weights & Biases. The model was developed using specific versions of key frameworks:

  • TRL: 0.16.0.dev0
  • Transformers: 4.57.6
  • Pytorch: 2.10.0

This model is a strong candidate for use cases demanding robust reasoning and accurate question-answering capabilities, particularly where the underlying Qwen2.5-7B base model's strengths are further amplified by targeted reinforcement learning.