zhaohq/PureRL-1.5B-v6b1-bare-fmt01

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 17, 2026Architecture:Transformer Warm

The zhaohq/PureRL-1.5B-v6b1-bare-fmt01 is a 1.5 billion parameter language model, fine-tuned from Qwen/Qwen2.5-Math-1.5B. It was trained using the TRL framework and incorporates the GRPO method, which is designed to enhance mathematical reasoning capabilities. This model is optimized for tasks requiring advanced mathematical problem-solving and logical deduction, leveraging its specialized training approach.

Loading preview...

Model Overview

The zhaohq/PureRL-1.5B-v6b1-bare-fmt01 is a 1.5 billion parameter language model derived from the Qwen/Qwen2.5-Math-1.5B base model. It has been specifically fine-tuned using the TRL framework to improve its performance in mathematical reasoning tasks.

Key Differentiator

The primary distinction of this model lies in its training methodology. It utilizes GRPO (Gradient-based Reward Policy Optimization), a method introduced in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". This approach aims to significantly enhance the model's ability to handle complex mathematical problems and logical deductions.

Training Details

  • Base Model: Qwen/Qwen2.5-Math-1.5B
  • Fine-tuning Framework: TRL (Transformer Reinforcement Learning)
  • Optimization Method: GRPO, as detailed in the DeepSeekMath paper.

Use Cases

This model is particularly well-suited for applications requiring strong mathematical reasoning and problem-solving. Developers can leverage it for tasks such as:

  • Solving mathematical equations and word problems.
  • Generating logical explanations for mathematical concepts.
  • Assisting in educational tools focused on mathematics.
  • Any application where robust mathematical understanding is critical.