zhaohq/PureRL-1.5B-v13A-lam002
The zhaohq/PureRL-1.5B-v13A-lam002 model is a 1.5 billion parameter language model, fine-tuned from Qwen/Qwen2.5-Math-1.5B, with a 32768 token context length. It was trained using the GRPO method, as introduced in the DeepSeekMath paper, focusing on mathematical reasoning capabilities. This model is optimized for tasks requiring advanced mathematical problem-solving and logical deduction.
Loading preview...
Model Overview
zhaohq/PureRL-1.5B-v13A-lam002 is a 1.5 billion parameter language model, building upon the Qwen/Qwen2.5-Math-1.5B architecture. It has been specifically fine-tuned using the TRL (Transformer Reinforcement Learning) framework, incorporating the GRPO (Gradient-based Reinforcement Learning with Policy Optimization) method. This training approach is derived from techniques highlighted in the research behind DeepSeekMath, which aims to enhance mathematical reasoning in large language models.
Key Characteristics
- Base Model: Fine-tuned from Qwen/Qwen2.5-Math-1.5B.
- Training Method: Utilizes GRPO, a method for improving mathematical reasoning, as detailed in the DeepSeekMath paper.
- Framework: Trained with Hugging Face's TRL library.
- Context Length: Supports a substantial context window of 32768 tokens.
Potential Use Cases
- Mathematical Problem Solving: Ideal for applications requiring robust mathematical reasoning and computation.
- Logical Deduction: Suitable for tasks that benefit from enhanced logical processing capabilities.
- Research and Development: Can serve as a base for further experimentation with reinforcement learning techniques in language models, particularly for specialized domains.