zhaohq/PureRL-1.5B-v7-s2-l2-maskon-fixed

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 20, 2026Architecture:Transformer Warm

The zhaohq/PureRL-1.5B-v7-s2-l2-maskon-fixed model is a 1.5 billion parameter language model fine-tuned by zhaohq using the TRL framework. It was trained with GRPO, a method detailed in the DeepSeekMath paper, which focuses on mathematical reasoning. This model is designed for general text generation tasks, leveraging its specialized training for potentially enhanced logical coherence.

Loading preview...

Model Overview

The zhaohq/PureRL-1.5B-v7-s2-l2-maskon-fixed is a 1.5 billion parameter language model developed by zhaohq. It has been fine-tuned using the TRL (Transformer Reinforcement Learning) framework, indicating a reinforcement learning approach to optimize its performance.

Key Training Methodology

A distinguishing feature of this model is its training procedure, which incorporates GRPO (Generalized Reinforcement Learning with Policy Optimization). This method was introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". The application of GRPO suggests an emphasis on improving the model's ability to handle complex reasoning tasks, potentially extending to areas beyond pure mathematics.

Intended Use

This model is suitable for various text generation tasks, particularly where improved logical consistency or reasoning capabilities are beneficial. Its training with GRPO, a method from a mathematical reasoning paper, implies a focus on structured and coherent output, making it a candidate for applications requiring more than just fluent text.