zhaohq/PureRL-1.5B-v7-s2-l2-kl-w3-b2

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 21, 2026Architecture:Transformer Warm

The zhaohq/PureRL-1.5B-v7-s2-l2-kl-w3-b2 model is a 1.5 billion parameter language model fine-tuned using the TRL framework. It was trained with GRPO, a method designed to enhance mathematical reasoning, as introduced in the DeepSeekMath paper. This model is optimized for tasks requiring robust mathematical and logical problem-solving capabilities. Its training methodology suggests a focus on improving reasoning performance in open language models.

Loading preview...

Model Overview

zhaohq/PureRL-1.5B-v7-s2-l2-kl-w3-b2 is a 1.5 billion parameter language model that has been fine-tuned using the TRL (Transformer Reinforcement Learning) framework. A key aspect of its training procedure involves the application of GRPO (Generalized Reinforcement Learning with Policy Optimization), a method specifically highlighted in the research behind DeepSeekMath. This indicates a specialized focus on improving the model's ability to handle complex mathematical reasoning tasks.

Key Training Details

  • Fine-tuning Framework: TRL (version 0.16.0.dev0)
  • Optimization Method: GRPO, as described in the "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" paper.
  • Framework Versions: Utilizes Transformers 4.48.3, Pytorch 2.5.1, Datasets 4.0.0, and Tokenizers 0.21.1.

Potential Use Cases

  • Mathematical Reasoning: Due to its GRPO training, this model is likely well-suited for tasks involving mathematical problem-solving and logical deduction.
  • Research and Development: Useful for researchers exploring reinforcement learning techniques in language model fine-tuning, particularly those interested in GRPO's application.