zhaohq/PureRL-1.5B-v7-s2-l1-maskoff
The zhaohq/PureRL-1.5B-v7-s2-l1-maskoff model is a 1.5 billion parameter language model fine-tuned by zhaohq. It utilizes the GRPO training method, as introduced in the DeepSeekMath paper, to enhance its reasoning capabilities. This model is specifically optimized for tasks requiring advanced mathematical and logical reasoning. With a context length of 32768 tokens, it is suitable for processing extensive inputs in reasoning-intensive applications.
Loading preview...
Model Overview
zhaohq/PureRL-1.5B-v7-s2-l1-maskoff is a 1.5 billion parameter language model developed by zhaohq. This model has been fine-tuned using the TRL framework and incorporates the GRPO (Generalized Reinforcement Learning with Policy Optimization) training method. GRPO, detailed in the "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" paper, is designed to significantly improve a model's mathematical and logical reasoning abilities.
Key Capabilities
- Enhanced Reasoning: Leverages the GRPO training method for improved performance on complex reasoning tasks, particularly in mathematical domains.
- Fine-tuned Architecture: Built upon an unspecified base model and refined using the TRL library, indicating a focus on reinforcement learning from human feedback or similar optimization.
- Extended Context Window: Supports a substantial context length of 32768 tokens, allowing for the processing of longer and more intricate problem descriptions or dialogues.
Training Details
The model's training procedure is publicly viewable via Weights & Biases, providing transparency into its development. It was trained with specific versions of key frameworks:
- TRL: 0.16.0.dev0
- Transformers: 4.48.3
- Pytorch: 2.5.1
- Datasets: 4.0.0
- Tokenizers: 0.21.1
Good For
- Applications requiring strong mathematical problem-solving.
- Tasks that benefit from advanced logical deduction.
- Research into reinforcement learning-based fine-tuning for reasoning.