zhaohq/PureRL-1.5B-v12D-lam025
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 19, 2026Architecture:Transformer Warm
PureRL-1.5B-v12D-lam025 is a 1.5 billion parameter language model developed by zhaohq, fine-tuned from Qwen/Qwen2.5-Math-1.5B. This model utilizes the GRPO training method, as introduced in the DeepSeekMath paper, to enhance its capabilities. With a 32768-token context length, it is optimized for advanced mathematical reasoning and complex problem-solving tasks.
Loading preview...
Overview
This model, PureRL-1.5B-v12D-lam025, is a 1.5 billion parameter language model developed by zhaohq. It is a fine-tuned version of the Qwen/Qwen2.5-Math-1.5B base model, leveraging the TRL (Transformer Reinforcement Learning) framework for its training.
Key Capabilities
- Mathematical Reasoning: The model's training incorporates the GRPO (Generalized Reinforcement Learning with Policy Optimization) method, which is specifically designed to push the limits of mathematical reasoning in open language models, as detailed in the DeepSeekMath paper.
- Reinforcement Learning Fine-tuning: Trained using the TRL library, indicating an optimization approach that likely enhances its ability to follow instructions and generate coherent, task-specific responses.
- Context Length: Supports a substantial context window of 32768 tokens, allowing it to process and generate longer, more complex sequences of text.
Good For
- Applications requiring advanced mathematical problem-solving.
- Tasks benefiting from models fine-tuned with reinforcement learning techniques.
- Scenarios where a 1.5 billion parameter model with a large context window is suitable for balancing performance and computational resources.