kangdawei/MMR-DAPO

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Dec 7, 2025Architecture:Transformer Warm

MMR-DAPO by kangdawei is a 1.5 billion parameter language model, fine-tuned from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B. It was trained using the DAPO reinforcement learning method on the knoveleng/open-rs dataset, featuring a substantial 131072-token context length. This model is optimized for generating responses based on its specialized training, making it suitable for conversational AI and text generation tasks.

Loading preview...

Model Overview

MMR-DAPO is a 1.5 billion parameter language model developed by kangdawei, building upon the deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B architecture. It distinguishes itself through its training methodology, utilizing the DAPO (Deep Reinforcement Learning for Language Models) method, as detailed in the paper "DAPO: An Open-Source LLM Reinforcement Learning System at Scale" (arXiv:2503.14476). The model was fine-tuned on the knoveleng/open-rs dataset, indicating a specialization in areas covered by this dataset.

Key Capabilities

  • Reinforcement Learning Optimization: Leverages the DAPO method for enhanced performance in text generation tasks.
  • Extended Context Window: Features a significant context length of 131072 tokens, allowing for processing and generating longer, more coherent texts.
  • Instruction Following: Designed to generate responses based on user prompts, as demonstrated by the quick start example.

Training Details

MMR-DAPO was trained using the TRL (Transformer Reinforcement Learning) framework, specifically version 0.16.0.dev0, with Transformers 4.57.1 and Pytorch 2.5.1. This setup highlights a focus on advanced fine-tuning techniques for language models.

Good For

  • Conversational AI: Its instruction-tuned nature and context handling make it suitable for interactive dialogue systems.
  • Text Generation: Effective for generating creative or informative text based on given prompts.
  • Research in RLHF: Provides a practical example of a model trained with the DAPO method, useful for researchers exploring reinforcement learning in LLMs.