zhaohq/PureRL-1.5B-v5-06-mc2
The zhaohq/PureRL-1.5B-v5-06-mc2 model is a 1.5 billion parameter language model, fine-tuned from Qwen/Qwen2.5-Math-1.5B, with a 32K context length. It was trained using the GRPO method, which is designed to enhance mathematical reasoning capabilities. This model is optimized for tasks requiring advanced mathematical problem-solving and logical deduction.
Loading preview...
Model Overview
zhaohq/PureRL-1.5B-v5-06-mc2 is a 1.5 billion parameter language model, building upon the Qwen/Qwen2.5-Math-1.5B architecture. It features a substantial context window of 32,768 tokens, making it suitable for processing longer inputs and complex problem statements. The model's development utilized the TRL framework and incorporated the GRPO (Gradient-based Reward Policy Optimization) method, as detailed in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". This training approach specifically targets the enhancement of mathematical reasoning abilities.
Key Capabilities
- Enhanced Mathematical Reasoning: Fine-tuned with GRPO, a method designed to improve performance on mathematical tasks.
- Long Context Understanding: Supports a 32K token context length, allowing for the processing of extensive problem descriptions or multi-step reasoning chains.
- Qwen2.5-Math Base: Leverages the foundational capabilities of the Qwen2.5-Math-1.5B model, which is inherently strong in mathematical domains.
Good For
- Mathematical Problem Solving: Ideal for applications requiring accurate and robust mathematical reasoning.
- Complex Logical Deduction: Suitable for tasks that benefit from processing detailed information and deriving logical conclusions.
- Research and Development: Provides a base for further experimentation and fine-tuning in mathematical AI applications, particularly those exploring reinforcement learning techniques like GRPO.