zhaohq/PureRL-7B-v6e-B-lam03-sigmoid-maskon-acc05

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:May 18, 2026Architecture:Transformer0.0K Warm

PureRL-7B-v6e-B-lam03-sigmoid-maskon-acc05 is a 7.6 billion parameter language model fine-tuned from Qwen/Qwen2.5-Math-7B by zhaohq. This model was trained using the GRPO method, which is designed to enhance mathematical reasoning capabilities. It leverages a 32768 token context length, making it suitable for tasks requiring deep contextual understanding, particularly in mathematical domains.

Loading preview...

Overview

This model, PureRL-7B-v6e-B-lam03-sigmoid-maskon-acc05, is a 7.6 billion parameter language model developed by zhaohq. It is a fine-tuned version of the Qwen/Qwen2.5-Math-7B base model, specifically optimized for mathematical reasoning tasks. The model was trained using the Transformer Reinforcement Learning (TRL) framework, incorporating the GRPO (Gradient-based Reward Policy Optimization) method.

Key Training Details

  • Base Model: Qwen/Qwen2.5-Math-7B
  • Training Method: GRPO, as introduced in the DeepSeekMath paper, which focuses on pushing the limits of mathematical reasoning in open language models.
  • Framework: TRL (Transformer Reinforcement Learning)
  • Context Length: Supports a context length of 32768 tokens.

Use Cases

This model is particularly well-suited for applications requiring advanced mathematical problem-solving and reasoning. Its fine-tuning with the GRPO method suggests improved performance on complex mathematical queries and tasks compared to general-purpose language models.