mimoidochi/OpenRS-GRPO
mimoidochi/OpenRS-GRPO is a 1.5 billion parameter language model fine-tuned from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, featuring a 32768 token context length. It was trained using the GRPO (Gradient-based Reinforcement Learning with Policy Optimization) method, originally introduced for mathematical reasoning tasks. This model is specifically optimized for response generation based on the knoveleng/open-rs dataset, making it suitable for conversational AI and question-answering applications.
Loading preview...
Model Overview
mimoidochi/OpenRS-GRPO is a 1.5 billion parameter language model built upon the deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B architecture. It has been fine-tuned on the knoveleng/open-rs dataset using the TRL library.
Key Training Methodology
A distinguishing feature of this model is its training with GRPO (Gradient-based Reinforcement Learning with Policy Optimization). This method, detailed in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300), suggests an optimization for reasoning capabilities. The training process utilized TRL version 0.14.0, Transformers 4.49.0, Pytorch 2.5.1, Datasets 4.5.0, and Tokenizers 0.21.4.
Use Cases
Given its fine-tuning on the open-rs dataset, OpenRS-GRPO is particularly well-suited for:
- Conversational AI: Generating coherent and contextually relevant responses in dialogue systems.
- Question Answering: Providing detailed answers to user queries.
- General Text Generation: Creating human-like text based on prompts, leveraging its reasoning-oriented training.