Overview
This model, hector-gr/RLCR-v4-ks-highcov-batch-cold-math, is a 7.6 billion parameter language model fine-tuned from the Qwen/Qwen2.5-7B base model. It was developed by hector-gr using the TRL framework and incorporates the GRPO (Gradient-based Reward Policy Optimization) method. The GRPO method, detailed in the DeepSeekMath paper, is specifically designed to push the limits of mathematical reasoning in open language models.
Key Capabilities
- Enhanced Mathematical Reasoning: Leverages the GRPO method for improved performance on mathematical tasks.
- Large Context Window: Supports a context length of 32768 tokens, allowing for processing longer and more complex inputs.
- Qwen2.5 Architecture: Benefits from the robust architecture of the Qwen2.5 series.
Training Details
The model's training procedure involved the TRL library (version 0.16.0.dev0) and utilized PyTorch 2.5.1. The application of GRPO suggests a focus on optimizing the model's ability to handle intricate mathematical problems and logical reasoning.
When to Use
This model is particularly well-suited for applications requiring strong mathematical problem-solving and reasoning abilities, especially where the DeepSeekMath approach to mathematical reasoning is beneficial.