zhaohq/PureRL-1.5B-v6d2-lam01-identity-maskon-acc05

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 18, 2026Architecture:Transformer Warm

The zhaohq/PureRL-1.5B-v6d2-lam01-identity-maskon-acc05 model is a 1.5 billion parameter language model fine-tuned from Qwen/Qwen2.5-Math-1.5B. Developed by zhaohq, this model utilizes the GRPO training method, which is designed to enhance mathematical reasoning capabilities. With a context length of 32768 tokens, it is optimized for tasks requiring advanced mathematical problem-solving and reasoning.

Loading preview...

Model Overview

The zhaohq/PureRL-1.5B-v6d2-lam01-identity-maskon-acc05 is a 1.5 billion parameter language model, building upon the foundation of the Qwen/Qwen2.5-Math-1.5B architecture. It has been specifically fine-tuned using the TRL (Transformer Reinforcement Learning) framework.

Key Differentiator

This model's primary distinction lies in its training methodology. It was developed using GRPO (Generalized Reinforcement Learning with Policy Optimization), a technique introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This indicates a focus on improving the model's ability to handle complex mathematical reasoning tasks.

Technical Specifications

  • Base Model: Qwen/Qwen2.5-Math-1.5B
  • Parameter Count: 1.5 billion
  • Context Length: 32768 tokens
  • Training Framework: TRL (version 0.16.0.dev0)
  • Training Method: GRPO

Potential Use Cases

Given its specialized training with GRPO and its origin from a math-focused base model, this model is likely well-suited for:

  • Mathematical problem-solving
  • Reasoning tasks requiring logical deduction
  • Applications where enhanced numerical understanding is critical