zhaohq/PureRL-1.5B-v7-s2-l2-kl-w0-b2

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 21, 2026Architecture:Transformer0.0K Warm

The zhaohq/PureRL-1.5B-v7-s2-l2-kl-w0-b2 model is a 1.5 billion parameter language model fine-tuned using the GRPO method, which is designed to enhance mathematical reasoning capabilities. Developed by zhaohq, this model leverages techniques from the DeepSeekMath paper. It is optimized for tasks requiring robust reasoning, particularly in mathematical contexts, and was trained using the TRL framework.

Loading preview...

Model Overview

zhaohq/PureRL-1.5B-v7-s2-l2-kl-w0-b2 is a 1.5 billion parameter language model developed by zhaohq. It has been fine-tuned using the GRPO (Gradient-based Reward Policy Optimization) method, a technique introduced in the "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" paper. This fine-tuning process aims to significantly improve the model's capabilities in complex reasoning tasks, particularly those involving mathematics.

Key Characteristics

  • Parameter Count: 1.5 billion parameters, offering a balance between performance and computational efficiency.
  • Training Method: Utilizes GRPO, a specialized reinforcement learning technique for enhancing reasoning abilities.
  • Framework: Trained with the TRL (Transformer Reinforcement Learning) library, indicating a focus on advanced fine-tuning strategies.

Potential Use Cases

  • Mathematical Reasoning: Ideal for applications requiring problem-solving, logical deduction, and mathematical understanding.
  • Complex Query Answering: Can be applied to tasks where answers require more than simple retrieval, demanding deeper analytical processing.
  • Research and Development: Suitable for researchers exploring advanced fine-tuning methods and their impact on reasoning capabilities in smaller language models.