zhaohq/PureRL-1.5B-v7-s2-l2-maskoff
The zhaohq/PureRL-1.5B-v7-s2-l2-maskoff model is a 1.5 billion parameter language model fine-tuned using Reinforcement Learning (RL) with the GRPO method. This model, developed by zhaohq, leverages techniques from the DeepSeekMath paper to enhance its reasoning capabilities. With a context length of 32768 tokens, it is optimized for generating coherent and contextually relevant text, particularly in response to complex prompts.
Loading preview...
Model Overview
The zhaohq/PureRL-1.5B-v7-s2-l2-maskoff is a 1.5 billion parameter language model developed by zhaohq. It has been fine-tuned using Reinforcement Learning (RL) through the TRL library, specifically employing the GRPO method.
Key Characteristics
- Reinforcement Learning Fine-tuning: The model's training incorporates GRPO (Generalized Reinforcement Learning with Policy Optimization), a method introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This suggests an emphasis on improving reasoning and response quality.
- Parameter Count: With 1.5 billion parameters, it offers a balance between computational efficiency and performance.
- Context Length: The model supports a substantial context length of 32768 tokens, allowing it to process and generate longer, more detailed responses while maintaining context.
Training Details
The model was trained using TRL version 0.16.0.dev0, Transformers 4.48.3, Pytorch 2.5.1, Datasets 4.0.0, and Tokenizers 0.21.1. The training process can be visualized via Weights & Biases.
Use Cases
This model is suitable for text generation tasks where nuanced understanding and coherent, context-aware responses are required, potentially benefiting from its RL-enhanced reasoning capabilities.