zhaohq/PureRL-1.5B-v5-06-mc2

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 17, 2026Architecture:Transformer Warm

The zhaohq/PureRL-1.5B-v5-06-mc2 model is a 1.5 billion parameter language model, fine-tuned from Qwen/Qwen2.5-Math-1.5B, with a 32K context length. It was trained using the GRPO method, which is designed to enhance mathematical reasoning capabilities. This model is optimized for tasks requiring advanced mathematical problem-solving and logical deduction.

Loading preview...

Model Overview

zhaohq/PureRL-1.5B-v5-06-mc2 is a 1.5 billion parameter language model, building upon the Qwen/Qwen2.5-Math-1.5B architecture. It features a substantial context window of 32,768 tokens, making it suitable for processing longer inputs and complex problem statements. The model's development utilized the TRL framework and incorporated the GRPO (Gradient-based Reward Policy Optimization) method, as detailed in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". This training approach specifically targets the enhancement of mathematical reasoning abilities.

Key Capabilities

  • Enhanced Mathematical Reasoning: Fine-tuned with GRPO, a method designed to improve performance on mathematical tasks.
  • Long Context Understanding: Supports a 32K token context length, allowing for the processing of extensive problem descriptions or multi-step reasoning chains.
  • Qwen2.5-Math Base: Leverages the foundational capabilities of the Qwen2.5-Math-1.5B model, which is inherently strong in mathematical domains.

Good For

  • Mathematical Problem Solving: Ideal for applications requiring accurate and robust mathematical reasoning.
  • Complex Logical Deduction: Suitable for tasks that benefit from processing detailed information and deriving logical conclusions.
  • Research and Development: Provides a base for further experimentation and fine-tuning in mathematical AI applications, particularly those exploring reinforcement learning techniques like GRPO.