zhaohq/PureRL-1.5B-v7-s2-l2-kl-w3-b0
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 21, 2026Architecture:Transformer0.0K Warm
The zhaohq/PureRL-1.5B-v7-s2-l2-kl-w3-b0 model is a 1.5 billion parameter language model developed by zhaohq, fine-tuned from zhaohq/PureRL-1.5B-v7-stage1-reasoning. It was trained using the TRL framework and incorporates the GRPO method, which is designed to enhance mathematical reasoning capabilities. This model is specifically optimized for tasks requiring advanced mathematical reasoning, building upon its stage1 reasoning foundation.
Loading preview...
Model Overview
This model, zhaohq/PureRL-1.5B-v7-s2-l2-kl-w3-b0, is a 1.5 billion parameter language model developed by zhaohq. It is a fine-tuned iteration of the zhaohq/PureRL-1.5B-v7-stage1-reasoning model, indicating a specialized focus on refining reasoning abilities.
Key Capabilities & Training
- Enhanced Reasoning: The model's training incorporates the GRPO (Gradient-based Reward Policy Optimization) method, as introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This suggests a strong emphasis on improving mathematical and general reasoning performance.
- Fine-tuned with TRL: It was trained using the TRL (Transformer Reinforcement Learning) framework, a common approach for aligning language models with human preferences or specific task objectives.
Good For
- Mathematical Reasoning Tasks: Given its foundation and the application of the GRPO method, this model is particularly suited for applications requiring robust mathematical problem-solving and reasoning.
- Research and Development: Developers and researchers interested in exploring models fine-tuned with advanced reinforcement learning techniques like GRPO for reasoning tasks may find this model valuable.