zhaohq/PureRL-1.5B-v7-stage1-A-fewshot

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 20, 2026Architecture:Transformer Warm

The zhaohq/PureRL-1.5B-v7-stage1-A-fewshot model is a 1.5 billion parameter language model fine-tuned from Qwen/Qwen2.5-Math-1.5B. Developed by zhaohq, it utilizes the TRL framework and was trained with GRPO, a method specifically designed for mathematical reasoning. This model is optimized for enhancing mathematical problem-solving capabilities in open language models.

Loading preview...

Model Overview

This model, zhaohq/PureRL-1.5B-v7-stage1-A-fewshot, is a 1.5 billion parameter language model derived from the Qwen/Qwen2.5-Math-1.5B base. It has been fine-tuned using the TRL (Transformer Reinforcement Learning) framework.

Key Training Details

The model's training procedure incorporated GRPO (Gradient-based Reinforcement Learning with Policy Optimization). This method was introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models," indicating a specialized focus on improving mathematical reasoning abilities.

Intended Use

Given its foundation in a math-focused base model and training with GRPO, this model is particularly suited for tasks requiring enhanced mathematical reasoning and problem-solving. Developers can integrate it using the Hugging Face pipeline for text generation tasks.

Framework Versions

Key frameworks used during its development include:

  • TRL: 0.16.0.dev0
  • Transformers: 4.57.6
  • Pytorch: 2.10.0
  • Datasets: 4.8.5
  • Tokenizers: 0.22.2