zhaohq/PureRL-1.5B-v7-s2-l2-kl-w3-b1
The zhaohq/PureRL-1.5B-v7-s2-l2-kl-w3-b1 model is a 1.5 billion parameter language model developed by zhaohq, fine-tuned from zhaohq/PureRL-1.5B-v7-stage1-reasoning. It utilizes the GRPO method, as introduced in the DeepSeekMath paper, for its training procedure. This model is designed for general text generation tasks, leveraging a 32768 token context length. Its fine-tuning with GRPO suggests an optimization for reasoning capabilities, particularly in mathematical contexts.
Loading preview...
Model Overview
This model, zhaohq/PureRL-1.5B-v7-s2-l2-kl-w3-b1, is a 1.5 billion parameter language model developed by zhaohq. It is a fine-tuned iteration of the zhaohq/PureRL-1.5B-v7-stage1-reasoning base model, leveraging a substantial 32768 token context window.
Key Training Details
The model's training procedure is notable for its use of GRPO (Generalized Reinforcement Learning from Policy Optimization). This method was originally introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). The application of GRPO suggests a focus on enhancing the model's reasoning abilities, potentially making it more robust for complex problem-solving tasks.
Training was conducted using the TRL library, with specific framework versions including TRL 0.16.0.dev0, Transformers 4.48.3, Pytorch 2.5.1+cu124, Datasets 4.0.0, and Tokenizers 0.21.1.
Potential Use Cases
Given its fine-tuning with a method aimed at improving reasoning, this model could be particularly suitable for:
- General text generation: Capable of generating coherent and contextually relevant responses.
- Reasoning-intensive tasks: Potentially performs well on tasks requiring logical deduction or problem-solving, especially those with a mathematical or structured reasoning component, due to its GRPO training.
- Long-context applications: Its 32768 token context length allows for processing and generating longer texts while maintaining coherence.