zhaohq/PureRL-1.5B-v11C-lam010
zhaohq/PureRL-1.5B-v11C-lam010 is a 1.5 billion parameter language model fine-tuned from Qwen/Qwen2.5-Math-1.5B. Developed by zhaohq, this model utilizes the GRPO method, as introduced in the DeepSeekMath paper, for enhanced mathematical reasoning capabilities. With a context length of 32768 tokens, it is optimized for tasks requiring robust mathematical problem-solving and logical deduction.
Loading preview...
Model Overview
zhaohq/PureRL-1.5B-v11C-lam010 is a 1.5 billion parameter language model, building upon the Qwen/Qwen2.5-Math-1.5B base model. It has been specifically fine-tuned using the TRL framework, incorporating the GRPO (Generalized Reinforcement Learning with Policy Optimization) method.
Key Capabilities
- Enhanced Mathematical Reasoning: The model's training with GRPO, a technique detailed in the DeepSeekMath paper, aims to push the limits of mathematical reasoning in open language models.
- Reinforcement Learning Fine-tuning: Leverages the TRL library for its training procedure, indicating a focus on improving performance through reinforcement learning techniques.
- Qwen2.5-Math Base: Benefits from the strong mathematical foundation of its base model, Qwen/Qwen2.5-Math-1.5B.
Training Details
The model's training procedure involved GRPO, a method described in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This suggests an emphasis on improving the model's ability to handle complex mathematical problems and logical sequences. The training utilized TRL version 0.16.0.dev0, Transformers 4.48.3, Pytorch 2.5.1, Datasets 4.0.0, and Tokenizers 0.21.1.
Good For
- Applications requiring strong mathematical problem-solving.
- Tasks involving logical reasoning and deduction.
- Research into reinforcement learning applications for language models.