zhaohq/PureRL-1.5B-v9E-digit-w050
The zhaohq/PureRL-1.5B-v9E-digit-w050 model is a 1.5 billion parameter language model fine-tuned from Qwen/Qwen2.5-Math-1.5B. Developed by zhaohq, it utilizes the GRPO training method, which is designed to enhance mathematical reasoning capabilities. With a context length of 32768 tokens, this model is optimized for tasks requiring robust mathematical problem-solving and reasoning.
Loading preview...
Model Overview
zhaohq/PureRL-1.5B-v9E-digit-w050 is a 1.5 billion parameter language model, building upon the Qwen/Qwen2.5-Math-1.5B architecture. It has been specifically fine-tuned using the TRL framework, incorporating the GRPO (Gradient-based Reward Policy Optimization) method. This training approach is derived from research presented in "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models," indicating a strong focus on improving mathematical reasoning abilities.
Key Capabilities
- Enhanced Mathematical Reasoning: Fine-tuned with GRPO, a method designed to improve performance on mathematical tasks.
- Qwen2.5 Base: Leverages the robust foundation of the Qwen2.5-Math-1.5B model.
- TRL Framework: Utilizes the Transformer Reinforcement Learning (TRL) library for its training procedure.
- Large Context Window: Supports a context length of 32768 tokens, allowing for processing longer inputs and more complex problems.
Training Details
The model's training involved GRPO, as detailed in the DeepSeekMath paper, suggesting an emphasis on optimizing for accurate mathematical problem-solving. The training process was tracked via Weights & Biases, providing transparency into its development.