The sravanthib/Qwen-2.5-7B-Simple-RL model is a 7.6 billion parameter language model fine-tuned from Qwen/Qwen2.5-Math-7B. It utilizes the GRPO method, introduced in DeepSeekMath, for training, making it particularly suited for mathematical reasoning tasks. This model offers a substantial 131,072 token context length, enhancing its ability to process and generate longer, more complex responses.
Loading preview...
Overview
This model, sravanthib/Qwen-2.5-7B-Simple-RL, is a 7.6 billion parameter language model built upon the Qwen/Qwen2.5-Math-7B base. It has been fine-tuned using the TRL (Transformer Reinforcement Learning) framework, specifically employing the GRPO (Generalized Reinforcement Learning with Policy Optimization) method. GRPO is a technique highlighted in the DeepSeekMath research, which focuses on enhancing mathematical reasoning capabilities in large language models.
Key Capabilities
- Enhanced Mathematical Reasoning: Fine-tuned with GRPO, suggesting improved performance on complex mathematical problems and logical deduction.
- Large Context Window: Supports a substantial context length of 131,072 tokens, allowing for processing and generating extensive text sequences.
- Reinforcement Learning Fine-tuning: Leverages advanced RL techniques for potentially more aligned and coherent outputs.
Training Details
The model's training procedure involved GRPO, a method detailed in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). The training utilized TRL version 0.16.0.dev0, with Transformers 4.49.0 and PyTorch 2.5.1.