zhaohq/PureRL-1.5B-v7-s2-corr-maskoff
The zhaohq/PureRL-1.5B-v7-s2-corr-maskoff is a 1.5 billion parameter language model with a 32768 token context length, fine-tuned using the TRL framework. This model was trained with GRPO, a method specifically designed to enhance mathematical reasoning capabilities. It is suitable for tasks requiring advanced mathematical problem-solving and logical deduction.
Loading preview...
Model Overview
The zhaohq/PureRL-1.5B-v7-s2-corr-maskoff is a 1.5 billion parameter language model, fine-tuned using the TRL (Transformer Reinforcement Learning) framework. It leverages a substantial context length of 32768 tokens, making it capable of processing extensive inputs.
Key Differentiator: GRPO Training
A core aspect of this model is its training methodology. It was specifically trained using GRPO (Gradient-based Reinforcement Learning with Policy Optimization), a method introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This training approach is designed to significantly enhance the model's capabilities in mathematical reasoning.
Capabilities
- Enhanced Mathematical Reasoning: Optimized for complex mathematical problem-solving due to its GRPO training.
- Long Context Understanding: Benefits from a 32768 token context window, allowing for detailed analysis of longer prompts and documents.
- TRL Framework: Built upon the TRL framework, indicating a reinforcement learning approach to fine-tuning.
Recommended Use Cases
This model is particularly well-suited for applications requiring:
- Solving mathematical problems and equations.
- Logical deduction and reasoning tasks.
- Processing and generating text where mathematical understanding is crucial.
Training Environment
The model was developed using specific versions of key frameworks:
- TRL: 0.16.0.dev0
- Transformers: 4.48.3
- Pytorch: 2.5.1
- Datasets: 4.0.0
- Tokenizers: 0.21.1