zhaohq/PureRL-1.5B-v7-s2-l2-maskon
PureRL-1.5B-v7-s2-l2-maskon is a 1.5 billion parameter language model developed by zhaohq, fine-tuned using the TRL framework. This model was trained with GRPO, a method specifically designed to enhance mathematical reasoning capabilities. With a context length of 32768 tokens, it is optimized for tasks requiring advanced mathematical problem-solving and logical deduction.
Loading preview...
Overview
zhaohq/PureRL-1.5B-v7-s2-l2-maskon is a 1.5 billion parameter language model, fine-tuned using the TRL (Transformer Reinforcement Learning) framework. This model leverages the GRPO (Generative Reinforcement Learning with Policy Optimization) training method, which is detailed in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". The training procedure utilized specific versions of TRL (0.16.0.dev0), Transformers (4.57.6), Pytorch (2.10.0), Datasets (4.8.5), and Tokenizers (0.22.2).
Key Capabilities
- Enhanced Mathematical Reasoning: Trained with GRPO, a method focused on improving mathematical problem-solving.
- Reinforcement Learning Fine-tuning: Utilizes the TRL library for advanced fine-tuning techniques.
- Large Context Window: Supports a context length of 32768 tokens, allowing for processing longer inputs and complex problems.
Good For
- Applications requiring strong mathematical reasoning.
- Research into reinforcement learning fine-tuning methods for language models.
- Tasks benefiting from a model with a substantial context window for detailed problem analysis.