zhaohq/PureRL-1.5B-v7-s2-l2-kl-w2-b2

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 21, 2026Architecture:Transformer Warm

The zhaohq/PureRL-1.5B-v7-s2-l2-kl-w2-b2 is a 1.5 billion parameter language model fine-tuned using the TRL framework. It leverages the GRPO method, introduced in the DeepSeekMath paper, to enhance its reasoning capabilities. This model is particularly suited for tasks requiring advanced mathematical or logical reasoning, building upon techniques designed for pushing the limits of mathematical problem-solving in open language models.

Loading preview...

Model Overview

zhaohq/PureRL-1.5B-v7-s2-l2-kl-w2-b2 is a 1.5 billion parameter language model that has been fine-tuned using the TRL (Transformer Reinforcement Learning) framework. A key aspect of its training methodology is the application of GRPO (Generalized Reinforcement Learning with Policy Optimization), a method detailed in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". This indicates a specialized focus on improving the model's ability to handle complex reasoning tasks.

Key Capabilities

  • Enhanced Reasoning: Utilizes the GRPO method, suggesting an optimization for tasks requiring logical and mathematical reasoning.
  • TRL Framework: Built upon the TRL library, indicating potential for further reinforcement learning-based fine-tuning or adaptation.

Good For

  • Mathematical Problem Solving: Given its training with GRPO from the DeepSeekMath paper, it is likely well-suited for mathematical reasoning and problem-solving tasks.
  • Complex Logical Queries: May perform effectively on tasks that demand structured logical thought processes.
  • Research and Development: Provides a base for exploring reinforcement learning techniques in language models, particularly for reasoning-intensive applications.