zhaohq/PureRL-1.5B-v5-06-uppl
The zhaohq/PureRL-1.5B-v5-06-uppl model is a 1.5 billion parameter language model, fine-tuned from Qwen/Qwen2.5-Math-1.5B with a 32K context length. It was trained using Reinforcement Learning (RL) via the TRL framework, specifically employing the GRPO method. This model is optimized for enhanced reasoning capabilities, particularly in mathematical contexts, building upon its base model's strengths.
Loading preview...
Model Overview
zhaohq/PureRL-1.5B-v5-06-uppl is a 1.5 billion parameter language model, fine-tuned from the Qwen/Qwen2.5-Math-1.5B base model. It leverages a 32,768 token context length, making it suitable for tasks requiring extensive contextual understanding.
Key Training Details
This model was trained using the TRL (Transformer Reinforcement Learning) framework. A notable aspect of its training procedure is the application of GRPO, a method introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This indicates a focus on improving mathematical reasoning and problem-solving abilities through reinforcement learning techniques.
Potential Use Cases
Given its foundation in Qwen2.5-Math-1.5B and subsequent fine-tuning with GRPO, this model is likely well-suited for:
- Mathematical reasoning tasks: Solving complex math problems, generating mathematical explanations, or assisting in scientific computations.
- General question answering: Benefiting from its fine-tuning to provide more coherent and logically sound responses.
- Applications requiring robust logical inference: Where the ability to follow multi-step reasoning is crucial.