zhaohq/PureRL-1.5B-v7-s2-l1-maskon
The zhaohq/PureRL-1.5B-v7-s2-l1-maskon model is a 1.5 billion parameter language model fine-tuned using the GRPO method, as introduced in the DeepSeekMath paper. This model specializes in enhancing mathematical reasoning capabilities in open language models. It is built upon an unspecified base model and trained with TRL, offering a focused approach to improving reasoning tasks.
Loading preview...
Model Overview
The zhaohq/PureRL-1.5B-v7-s2-l1-maskon is a 1.5 billion parameter language model developed by zhaohq. It is a fine-tuned version of an unspecified base model, leveraging the Transformer Reinforcement Learning (TRL) framework for its training.
Key Capabilities and Training
The primary differentiator of this model is its training procedure, which utilizes GRPO (Generalized Reinforcement Learning with Policy Optimization). This method was introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). The application of GRPO suggests an optimization for complex reasoning tasks, particularly in mathematical domains, aiming to improve the model's ability to process and generate logically sound responses.
Technical Details
- Parameters: 1.5 Billion
- Context Length: 32768 tokens
- Training Frameworks: TRL (version 0.16.0.dev0), Transformers (version 4.48.3), Pytorch (version 2.5.1), Datasets (version 4.0.0), Tokenizers (version 0.21.1).
Use Cases
This model is particularly suited for applications requiring enhanced mathematical reasoning and complex problem-solving. Its fine-tuning with GRPO indicates a focus on improving the logical coherence and accuracy of generated text in analytical contexts. Developers can integrate this model for tasks where robust reasoning capabilities are crucial, potentially outperforming general-purpose models in specific analytical domains.