zhaohq/PureRL-1.5B-v12B-lam005

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 19, 2026Architecture:Transformer Warm

The zhaohq/PureRL-1.5B-v12B-lam005 model is a 1.5 billion parameter language model, fine-tuned from Qwen/Qwen2.5-Math-1.5B with a 32K context length. Developed by zhaohq, this model utilizes the GRPO method, as introduced in the DeepSeekMath paper, for enhanced mathematical reasoning capabilities. It is specifically optimized for tasks requiring advanced mathematical problem-solving and logical deduction, building upon its Qwen2.5-Math base.

Loading preview...

Overview

This model, zhaohq/PureRL-1.5B-v12B-lam005, is a 1.5 billion parameter language model derived from the Qwen/Qwen2.5-Math-1.5B base. It has been fine-tuned using the TRL library and incorporates the GRPO (Generative Reinforcement Learning with Policy Optimization) training method.

Key Capabilities

  • Enhanced Mathematical Reasoning: The model's training with GRPO, a method detailed in the DeepSeekMath paper, suggests a focus on improving mathematical problem-solving and logical deduction.
  • Qwen2.5-Math Foundation: Built upon a model specifically designed for mathematical tasks, it inherits and refines capabilities in this domain.
  • 32K Context Length: Supports processing longer inputs and generating more extensive responses.

Training Details

The model was trained using the TRL framework (version 0.16.0.dev0) and leverages the GRPO method, which is known for pushing the limits of mathematical reasoning in open language models. This approach aims to optimize the model's ability to generate accurate and coherent mathematical solutions.

Good For

  • Applications requiring strong mathematical reasoning.
  • Tasks involving complex problem-solving and logical inference.
  • Research into reinforcement learning techniques for language models, particularly GRPO.