zhaohq/PureRL-1.5B-v7-s2-l2-kl-w3-b2
The zhaohq/PureRL-1.5B-v7-s2-l2-kl-w3-b2 model is a 1.5 billion parameter language model fine-tuned using the TRL framework. It was trained with GRPO, a method designed to enhance mathematical reasoning, as introduced in the DeepSeekMath paper. This model is optimized for tasks requiring robust mathematical and logical problem-solving capabilities. Its training methodology suggests a focus on improving reasoning performance in open language models.
Loading preview...
Model Overview
zhaohq/PureRL-1.5B-v7-s2-l2-kl-w3-b2 is a 1.5 billion parameter language model that has been fine-tuned using the TRL (Transformer Reinforcement Learning) framework. A key aspect of its training procedure involves the application of GRPO (Generalized Reinforcement Learning with Policy Optimization), a method specifically highlighted in the research behind DeepSeekMath. This indicates a specialized focus on improving the model's ability to handle complex mathematical reasoning tasks.
Key Training Details
- Fine-tuning Framework: TRL (version 0.16.0.dev0)
- Optimization Method: GRPO, as described in the "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" paper.
- Framework Versions: Utilizes Transformers 4.48.3, Pytorch 2.5.1, Datasets 4.0.0, and Tokenizers 0.21.1.
Potential Use Cases
- Mathematical Reasoning: Due to its GRPO training, this model is likely well-suited for tasks involving mathematical problem-solving and logical deduction.
- Research and Development: Useful for researchers exploring reinforcement learning techniques in language model fine-tuning, particularly those interested in GRPO's application.