zhaohq/PureRL-1.5B-v7-s2-l2-kl-w0-b1
The zhaohq/PureRL-1.5B-v7-s2-l2-kl-w0-b1 model is a 1.5 billion parameter language model, fine-tuned from zhaohq/PureRL-1.5B-v7-stage1-reasoning. Developed by zhaohq, this model was trained using the TRL framework and incorporates the GRPO method, which is designed to enhance mathematical reasoning capabilities. It is optimized for tasks requiring advanced reasoning, particularly in mathematical contexts, building upon its base model's reasoning foundation.
Loading preview...
Overview
The zhaohq/PureRL-1.5B-v7-s2-l2-kl-w0-b1 is a 1.5 billion parameter language model developed by zhaohq. It is a fine-tuned iteration of the zhaohq/PureRL-1.5B-v7-stage1-reasoning model, building upon its initial reasoning capabilities.
Key Training Details
This model was trained using the TRL (Transformer Reinforcement Learning) framework, specifically version 0.16.0.dev0. A significant aspect of its training procedure is the implementation of GRPO (Generalized Reinforcement Learning with Policy Optimization), a method introduced in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". This suggests a strong focus on improving mathematical and general reasoning performance.
Intended Use Cases
Given its fine-tuning with the GRPO method, this model is particularly suited for:
- Mathematical reasoning tasks: Leveraging the techniques from DeepSeekMath, it aims to excel in complex mathematical problem-solving.
- Advanced reasoning applications: Building on its stage1 reasoning base, it can be applied to tasks requiring logical deduction and problem-solving.
Developers can quickly get started using the provided transformers pipeline for text generation, as demonstrated in the quick start guide.