zhaohq/PureRL-1.5B-v7-stage1-A-fewshot
The zhaohq/PureRL-1.5B-v7-stage1-A-fewshot model is a 1.5 billion parameter language model fine-tuned from Qwen/Qwen2.5-Math-1.5B. Developed by zhaohq, it utilizes the TRL framework and was trained with GRPO, a method specifically designed for mathematical reasoning. This model is optimized for enhancing mathematical problem-solving capabilities in open language models.
Loading preview...
Model Overview
This model, zhaohq/PureRL-1.5B-v7-stage1-A-fewshot, is a 1.5 billion parameter language model derived from the Qwen/Qwen2.5-Math-1.5B base. It has been fine-tuned using the TRL (Transformer Reinforcement Learning) framework.
Key Training Details
The model's training procedure incorporated GRPO (Gradient-based Reinforcement Learning with Policy Optimization). This method was introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models," indicating a specialized focus on improving mathematical reasoning abilities.
Intended Use
Given its foundation in a math-focused base model and training with GRPO, this model is particularly suited for tasks requiring enhanced mathematical reasoning and problem-solving. Developers can integrate it using the Hugging Face pipeline for text generation tasks.
Framework Versions
Key frameworks used during its development include:
- TRL: 0.16.0.dev0
- Transformers: 4.57.6
- Pytorch: 2.10.0
- Datasets: 4.8.5
- Tokenizers: 0.22.2