zhaohq/PureRL-1.5B-v7-s2-corr-maskon-afew
zhaohq/PureRL-1.5B-v7-s2-corr-maskon-afew is a 1.5 billion parameter language model fine-tuned by zhaohq. This model is a fine-tuned version of zhaohq/PureRL-1.5B-v7-stage1-A-fewshot, trained using the TRL framework. It leverages the GRPO method, which is designed to enhance mathematical reasoning capabilities in open language models. The model is suitable for tasks requiring improved reasoning, particularly in mathematical contexts, building upon its base model's few-shot learning abilities.
Loading preview...
Model Overview
This model, zhaohq/PureRL-1.5B-v7-s2-corr-maskon-afew, is a 1.5 billion parameter language model developed by zhaohq. It is a fine-tuned iteration of the zhaohq/PureRL-1.5B-v7-stage1-A-fewshot base model, specifically trained using the TRL (Transformer Reinforcement Learning) framework.
Key Training Details
- Fine-tuning Method: The model was trained utilizing GRPO (Gradient-based Reward Policy Optimization), a method introduced in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models".
- Frameworks Used: Training involved TRL (version 0.16.0.dev0), Transformers (version 4.48.3), PyTorch (version 2.5.1+cu124), Datasets (version 4.0.0), and Tokenizers (version 0.21.1).
Potential Use Cases
Given its fine-tuning with the GRPO method, this model is likely optimized for:
- Mathematical Reasoning: Tasks that require advanced mathematical problem-solving and logical deduction.
- Enhanced Reasoning: General reasoning tasks where the GRPO method's benefits can be applied beyond pure mathematics.
Developers can quickly integrate this model using the Hugging Face pipeline for text generation, as demonstrated in the quick start example.