zhaohq/PureRL-1.5B-v7-s2-l2-kl-w2-b0

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 21, 2026Architecture:Transformer0.0K Warm

The zhaohq/PureRL-1.5B-v7-s2-l2-kl-w2-b0 model is a 1.5 billion parameter language model developed by zhaohq, fine-tuned from PureRL-1.5B-v7-stage1-reasoning. It utilizes the GRPO training method, as introduced in the DeepSeekMath paper, to enhance mathematical reasoning capabilities. With a context length of 32768 tokens, this model is primarily designed for advanced reasoning tasks, particularly in mathematical domains.

Loading preview...

Overview

zhaohq/PureRL-1.5B-v7-s2-l2-kl-w2-b0 is a 1.5 billion parameter language model, fine-tuned by zhaohq from its base model, PureRL-1.5B-v7-stage1-reasoning. This model leverages the GRPO (Generalized Reinforcement Learning from Policy Optimization) training method, a technique highlighted in the DeepSeekMath paper, which focuses on pushing the limits of mathematical reasoning in open language models. It supports a substantial context length of 32768 tokens.

Key Capabilities

  • Enhanced Mathematical Reasoning: Benefits from the GRPO training procedure, making it suitable for tasks requiring advanced logical and mathematical problem-solving.
  • Fine-tuned Performance: Built upon a reasoning-focused base model, further optimized for specific performance characteristics.
  • Extended Context Window: Offers a 32768-token context length, allowing for processing longer inputs and more complex problem descriptions.

Good for

  • Mathematical Problem Solving: Ideal for applications that involve complex mathematical reasoning, logical deduction, and quantitative analysis.
  • Research and Development: Useful for researchers exploring reinforcement learning from human feedback (RLHF) techniques, particularly GRPO, in smaller-scale models.
  • Question Answering: Can be applied to question-answering systems where the questions require deep reasoning rather than simple fact retrieval.