zhaohq/PureRL-1.5B-v14L-stage1-bce-binary-k8
The zhaohq/PureRL-1.5B-v14L-stage1-bce-binary-k8 is a 1.5 billion parameter language model fine-tuned using the TRL framework. It was trained with GRPO, a method introduced in the DeepSeekMath paper, which focuses on mathematical reasoning. This model is designed for general text generation tasks, leveraging its specialized training approach to potentially enhance reasoning capabilities.
Loading preview...
Model Overview
The zhaohq/PureRL-1.5B-v14L-stage1-bce-binary-k8 is a 1.5 billion parameter language model. It has been fine-tuned using the TRL (Transformer Reinforcement Learning) framework, a library for training transformer language models with reinforcement learning.
Key Training Details
A notable aspect of this model's development is its training procedure, which incorporated GRPO (Generalized Reinforcement Learning with Policy Optimization). This method was originally introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." While the base model is not specified, the application of GRPO suggests an emphasis on improving reasoning capabilities, potentially extending beyond just mathematical contexts.
Quick Start
Users can quickly get started with this model using the transformers library, as demonstrated by the provided Python pipeline example for text generation. The model supports a context length of 32768 tokens.
Framework Versions
The model was trained using specific versions of key frameworks:
- TRL: 0.16.0.dev0
- Transformers: 4.48.3
- Pytorch: 2.5.1
- Datasets: 4.0.0
- Tokenizers: 0.21.1
Potential Use Cases
Given its training methodology, this model could be particularly suitable for:
- General text generation tasks where enhanced reasoning might be beneficial.
- Applications requiring a compact model (1.5B parameters) with a focus on structured or logical responses.
- Exploration of models fine-tuned with advanced reinforcement learning techniques like GRPO.