zhaohq/PureRL-1.5B-v13D-lam025
The zhaohq/PureRL-1.5B-v13D-lam025 model is a 1.5 billion parameter language model, fine-tuned from Qwen/Qwen2.5-Math-1.5B using Reinforcement Learning (RL) with the TRL framework. It leverages the GRPO method, originally introduced in DeepSeekMath, to enhance its capabilities. This model is designed for general text generation tasks, building upon a mathematical reasoning base, and supports a context length of 32768 tokens.
Loading preview...
Overview
This model, zhaohq/PureRL-1.5B-v13D-lam025, is a 1.5 billion parameter language model derived from the Qwen/Qwen2.5-Math-1.5B base model. It has been fine-tuned using Reinforcement Learning (RL) via the TRL library, specifically implementing the GRPO method. The GRPO method was first introduced in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models", suggesting an emphasis on robust reasoning capabilities.
Key Capabilities
- Reinforcement Learning Fine-tuning: Utilizes the GRPO method for training, which is known for enhancing mathematical reasoning in large language models.
- Base Model: Built upon Qwen2.5-Math-1.5B, indicating a foundation in mathematical and logical understanding.
- Context Length: Supports a substantial context window of 32768 tokens, allowing for processing longer inputs and generating more coherent, extended responses.
Good For
- General Text Generation: Capable of generating human-like text for various prompts, as demonstrated by the quick start example.
- Reasoning-based Tasks: Given its lineage and training method, it may perform well in tasks requiring logical inference or structured problem-solving.
- Exploration of RL-tuned Models: Developers interested in models fine-tuned with advanced RL techniques like GRPO can use this as a reference or starting point.