kangdawei/DAPO-No-DS-7B
The DAPO-No-DS-7B model by kangdawei is a 7.6 billion parameter language model, fine-tuned from deepseek-ai/DeepSeek-R1-Distill-Qwen-7B. It was trained using the DAPO (Deep Reinforcement Learning from Human Feedback) method on the knoveleng/open-rs dataset, offering a 131072 token context length. This model is specifically optimized for tasks benefiting from reinforcement learning at scale, leveraging the TRL framework.
Loading preview...
Model Overview
The DAPO-No-DS-7B is a 7.6 billion parameter language model developed by kangdawei. It is a fine-tuned variant of the deepseek-ai/DeepSeek-R1-Distill-Qwen-7B base model, specifically trained using the DAPO (Deep Reinforcement Learning from Human Feedback) method. This training approach, detailed in the paper "DAPO: An Open-Source LLM Reinforcement Learning System at Scale" (arXiv:2503.14476), leverages reinforcement learning to enhance model performance.
Key Characteristics
- Base Model: Fine-tuned from
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B. - Training Method: Utilizes the DAPO method for reinforcement learning at scale.
- Dataset: Trained on the
knoveleng/open-rsdataset. - Framework: Developed using the TRL (Transformer Reinforcement Learning) library.
- Context Length: Supports a substantial context window of 131072 tokens.
Intended Use
This model is suitable for applications requiring a large language model that has benefited from advanced reinforcement learning techniques. Its training on the knoveleng/open-rs dataset suggests potential strengths in areas related to the dataset's content, while the DAPO method aims to improve alignment and response quality through scaled reinforcement learning.