zhaohq/PureRL-1.5B-v7-s2-l2-maskon-afew
The zhaohq/PureRL-1.5B-v7-s2-l2-maskon-afew model is a 1.5 billion parameter language model, fine-tuned from zhaohq/PureRL-1.5B-v7-stage1-A-fewshot. It was trained using the TRL framework and incorporates GRPO, a method known for enhancing mathematical reasoning in large language models. This model is designed for general text generation tasks, leveraging its 32768-token context length for coherent and extended outputs.
Loading preview...
Model Overview
zhaohq/PureRL-1.5B-v7-s2-l2-maskon-afew is a 1.5 billion parameter language model, building upon the zhaohq/PureRL-1.5B-v7-stage1-A-fewshot base. It was developed using the TRL (Transformer Reinforcement Learning) framework, indicating a focus on reinforcement learning from human feedback or similar optimization techniques.
Key Training Details
A notable aspect of this model's training is the application of GRPO (Generalized Reinforcement Learning with Policy Optimization). This method, introduced in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models," suggests that this model may possess enhanced capabilities in areas requiring logical and mathematical reasoning, despite its general-purpose fine-tuning.
Usage and Capabilities
With a substantial context length of 32768 tokens, the model is well-suited for generating extended and contextually rich text. Its fine-tuning process aims to improve its ability to respond to diverse prompts, as demonstrated by the quick start example for open-ended questions. The model is designed for text generation tasks, offering a balance between parameter size and performance for various applications.