zhaohq/PureRL-1.5B-v6i-B-step01-final03

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 19, 2026Architecture:Transformer Warm

The zhaohq/PureRL-1.5B-v6i-B-step01-final03 model is a 1.5 billion parameter language model, fine-tuned from Qwen/Qwen2.5-Math-1.5B. It was trained using the GRPO method, as introduced in the DeepSeekMath paper, and leverages TRL for its training procedure. This model is specifically optimized for mathematical reasoning tasks, building upon its Qwen2.5-Math base with a 32768 token context length.

Loading preview...

Model Overview

zhaohq/PureRL-1.5B-v6i-B-step01-final03 is a 1.5 billion parameter language model, fine-tuned from the Qwen/Qwen2.5-Math-1.5B base model. It utilizes a substantial context length of 32768 tokens, making it suitable for processing longer inputs.

Key Capabilities

  • Mathematical Reasoning: This model is specifically enhanced for mathematical reasoning tasks, building on the capabilities of its Qwen2.5-Math foundation.
  • GRPO Training: It was trained using the GRPO (Gradient Regularized Policy Optimization) method, a technique detailed in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This training approach aims to improve performance in complex reasoning scenarios.
  • TRL Framework: The fine-tuning process was conducted using the TRL (Transformer Reinforcement Learning) library, a framework for applying reinforcement learning to transformer models.

Use Cases

This model is particularly well-suited for applications requiring strong mathematical problem-solving and reasoning abilities. Its training methodology and base model suggest its utility in tasks that benefit from advanced numerical and logical processing.