zhaohq/PureRL-1.5B-v5-06-umsp

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 15, 2026Architecture:Transformer Warm

PureRL-1.5B-v5-06-umsp by zhaohq is a 1.5 billion parameter language model, fine-tuned from Qwen2.5-Math-1.5B using the TRL framework. It leverages the GRPO training method, introduced in DeepSeekMath, to enhance mathematical reasoning capabilities. With a 32768 token context length, this model is primarily designed for tasks requiring advanced mathematical problem-solving and logical deduction.

Loading preview...

Overview

zhaohq/PureRL-1.5B-v5-06-umsp is a 1.5 billion parameter language model, fine-tuned from the Qwen/Qwen2.5-Math-1.5B base model. It was developed by zhaohq and utilizes the TRL (Transformer Reinforcement Learning) framework for its training. A key differentiator for this model is its application of the GRPO (Generalized Reinforcement Learning with Policy Optimization) training method, which was originally introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". This approach aims to significantly improve the model's mathematical reasoning abilities.

Key Capabilities

  • Enhanced Mathematical Reasoning: Optimized through the GRPO method, making it suitable for complex mathematical problems.
  • Reinforcement Learning Fine-tuning: Benefits from TRL framework for improved performance in specific tasks.
  • Qwen2.5-Math Base: Builds upon a strong foundation designed for mathematical understanding.

Good for

  • Applications requiring robust mathematical problem-solving.
  • Research and development in reinforcement learning for language models.
  • Tasks where logical deduction and numerical accuracy are critical.