zhaohq/PureRL-1.5B-v7-s2-l2-kl-w2-b1

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 21, 2026Architecture:Transformer Warm

zhaohq/PureRL-1.5B-v7-s2-l2-kl-w2-b1 is a 1.5 billion parameter language model developed by zhaohq, fine-tuned from zhaohq/PureRL-1.5B-v7-stage1-reasoning. This model was trained using the GRPO method, as introduced in the DeepSeekMath paper, which focuses on mathematical reasoning. It is optimized for generating responses to complex questions, leveraging its 32768 token context length.

Loading preview...

Model Overview

zhaohq/PureRL-1.5B-v7-s2-l2-kl-w2-b1 is a 1.5 billion parameter language model, building upon the zhaohq/PureRL-1.5B-v7-stage1-reasoning base model. It has been further fine-tuned using the TRL library and specifically employs the GRPO training method.

Key Training Details

  • Base Model: Fine-tuned from zhaohq/PureRL-1.5B-v7-stage1-reasoning.
  • Training Method: Utilizes GRPO (Generalized Reinforcement Learning with Policy Optimization), a technique highlighted in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This suggests an emphasis on enhancing reasoning capabilities.
  • Frameworks: Trained with TRL 0.16.0.dev0, Transformers 4.48.3, Pytorch 2.5.1+cu124, and Datasets 4.0.0.

Intended Use Cases

This model is suitable for tasks requiring reasoned responses, particularly those benefiting from the GRPO training approach. Its fine-tuning process indicates a focus on generating coherent and logical answers to user prompts, as demonstrated by the quick start example involving a hypothetical reasoning question.