kangdawei/DAPO-No-DS-7B

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Dec 7, 2025Architecture:Transformer Cold

The DAPO-No-DS-7B model by kangdawei is a 7.6 billion parameter language model, fine-tuned from deepseek-ai/DeepSeek-R1-Distill-Qwen-7B. It was trained using the DAPO (Deep Reinforcement Learning from Human Feedback) method on the knoveleng/open-rs dataset, offering a 131072 token context length. This model is specifically optimized for tasks benefiting from reinforcement learning at scale, leveraging the TRL framework.

Loading preview...

Model Overview

The DAPO-No-DS-7B is a 7.6 billion parameter language model developed by kangdawei. It is a fine-tuned variant of the deepseek-ai/DeepSeek-R1-Distill-Qwen-7B base model, specifically trained using the DAPO (Deep Reinforcement Learning from Human Feedback) method. This training approach, detailed in the paper "DAPO: An Open-Source LLM Reinforcement Learning System at Scale" (arXiv:2503.14476), leverages reinforcement learning to enhance model performance.

Key Characteristics

  • Base Model: Fine-tuned from deepseek-ai/DeepSeek-R1-Distill-Qwen-7B.
  • Training Method: Utilizes the DAPO method for reinforcement learning at scale.
  • Dataset: Trained on the knoveleng/open-rs dataset.
  • Framework: Developed using the TRL (Transformer Reinforcement Learning) library.
  • Context Length: Supports a substantial context window of 131072 tokens.

Intended Use

This model is suitable for applications requiring a large language model that has benefited from advanced reinforcement learning techniques. Its training on the knoveleng/open-rs dataset suggests potential strengths in areas related to the dataset's content, while the DAPO method aims to improve alignment and response quality through scaled reinforcement learning.