zhaohq/PureRL-1.5B-v12C-lam010
The zhaohq/PureRL-1.5B-v12C-lam010 model is a 1.5 billion parameter language model fine-tuned from Qwen/Qwen2.5-Math-1.5B, featuring a 32K context length. It was trained using Reinforcement Learning (RL) with the GRPO method, which is designed to enhance mathematical reasoning capabilities. This model is specifically optimized for tasks requiring advanced mathematical problem-solving and logical deduction.
Loading preview...
Model Overview
zhaohq/PureRL-1.5B-v12C-lam010 is a 1.5 billion parameter language model, fine-tuned from the Qwen/Qwen2.5-Math-1.5B base model. It leverages a substantial context length of 32,768 tokens, making it suitable for processing longer inputs related to complex problems.
Key Differentiator: GRPO Training
What sets this model apart is its training methodology. It was fine-tuned using Reinforcement Learning (RL) with the GRPO method, as introduced in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." This approach is specifically designed to improve the model's ability to perform advanced mathematical reasoning and problem-solving tasks.
Use Cases
- Mathematical Reasoning: Ideal for applications requiring robust mathematical problem-solving, logical deduction, and numerical analysis.
- Research and Development: Useful for researchers exploring the impact of RL-based fine-tuning methods like GRPO on specialized reasoning tasks.
Technical Details
The model was trained using the TRL library (version 0.16.0.dev0) and built upon Transformers 4.48.3 and Pytorch 2.5.1. This specific training regimen aims to enhance its performance in areas where precise mathematical understanding is critical.