zhaohq/PureRL-1.5B-v6g-A-lam01-sigmoid-maskoff
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 19, 2026Architecture:Transformer Warm
The zhaohq/PureRL-1.5B-v6g-A-lam01-sigmoid-maskoff model is a 1.5 billion parameter language model, fine-tuned from Qwen/Qwen2.5-Math-1.5B with a 32768 token context length. Developed by zhaohq, this model utilizes the GRPO method for training, focusing on enhancing mathematical reasoning capabilities. It is specifically optimized for tasks requiring advanced mathematical problem-solving and logical deduction.
Loading preview...
Model Overview
This model, PureRL-1.5B-v6g-A-lam01-sigmoid-maskoff, is a 1.5 billion parameter language model developed by zhaohq. It is a fine-tuned version of the Qwen/Qwen2.5-Math-1.5B base model, leveraging a substantial 32768 token context length.
Key Capabilities
- Enhanced Mathematical Reasoning: The model was trained using the GRPO (Gradient-based Reward Policy Optimization) method, as introduced in the DeepSeekMath paper, specifically to improve its mathematical problem-solving abilities.
- Fine-tuned with TRL: Training was conducted using the TRL library, a framework for Transformer Reinforcement Learning.
Good For
- Applications requiring strong mathematical reasoning and logical deduction.
- Tasks where a smaller, specialized model with a long context window is beneficial for mathematical or complex problem-solving.