zhaohq/PureRL-1.5B-v6d4-lam01-sigmoid-maskoff-acc05

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 18, 2026Architecture:Transformer Warm

The PureRL-1.5B-v6d4-lam01-sigmoid-maskoff-acc05 model by zhaohq is a 1.5 billion parameter language model, fine-tuned from Qwen/Qwen2.5-Math-1.5B with a 32768 token context length. It was trained using the GRPO method, a technique designed to enhance mathematical reasoning in language models. This model is specifically optimized for tasks requiring advanced mathematical problem-solving capabilities.

Loading preview...

Model Overview

The PureRL-1.5B-v6d4-lam01-sigmoid-maskoff-acc05 is a 1.5 billion parameter language model developed by zhaohq. It is a fine-tuned variant of the Qwen/Qwen2.5-Math-1.5B base model, inheriting its 32768 token context length.

Key Capabilities & Training

This model's primary differentiator lies in its training methodology. It was fine-tuned using GRPO (Generalized Reinforcement Learning with Policy Optimization), a method introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This specialized training aims to significantly improve the model's performance on mathematical reasoning tasks.

Technical Details

  • Base Model: Qwen/Qwen2.5-Math-1.5B
  • Parameter Count: 1.5 Billion
  • Context Length: 32768 tokens
  • Training Framework: TRL (Transformer Reinforcement Learning)
  • Training Method: GRPO, focused on mathematical reasoning enhancement.

Recommended Use Cases

This model is particularly well-suited for applications requiring robust mathematical problem-solving and reasoning. Its GRPO-based fine-tuning suggests improved capabilities in handling complex mathematical queries and generating accurate solutions, making it a strong candidate for educational tools, scientific research assistance, or any domain where precise mathematical understanding is critical.