zhaohq/PureRL-1.5B-v6g-B-lam03-sigmoid-maskoff

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 19, 2026Architecture:Transformer Warm

The zhaohq/PureRL-1.5B-v6g-B-lam03-sigmoid-maskoff model is a 1.5 billion parameter language model, fine-tuned from Qwen/Qwen2.5-Math-1.5B. It was trained using the TRL framework and incorporates the GRPO method, which is designed to enhance mathematical reasoning capabilities. This model is optimized for tasks requiring advanced reasoning, particularly in mathematical contexts, leveraging techniques from the DeepSeekMath research. With a context length of 32768 tokens, it is suitable for processing and generating responses for complex queries.

Loading preview...

Model Overview

The zhaohq/PureRL-1.5B-v6g-B-lam03-sigmoid-maskoff is a 1.5 billion parameter language model, building upon the base architecture of Qwen/Qwen2.5-Math-1.5B. It has been specifically fine-tuned using the TRL (Transformer Reinforcement Learning) framework.

Key Training Methodology

A core differentiator for this model is its training procedure, which utilizes GRPO (Generalized Reinforcement Learning with Policy Optimization). This method was introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). The application of GRPO suggests an optimization for tasks that demand robust mathematical and reasoning abilities.

Intended Use Cases

Given its foundation in a math-focused base model and fine-tuning with GRPO, this model is particularly well-suited for:

  • Mathematical Reasoning: Solving complex mathematical problems and generating logical explanations.
  • Problem Solving: Handling queries that require structured reasoning and analytical thinking.
  • Research and Development: As a base for further experimentation in reinforcement learning for language models, especially in mathematical domains.