Zhaoxuan/PUGC-Mistral-DPO

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kLicense:apache-2.0Architecture:Transformer0.0K Open Weights Cold

Zhaoxuan/PUGC-Mistral-DPO is a 7 billion parameter language model, fine-tuned from Mistral-7B-Instruct-v0.2 using Direct Preference Optimization (DPO). This model leverages implicit preferences extracted from user-generated content (UGC) to improve alignment and response quality. It is specifically designed to enhance LLM alignment by transforming UGC into preference data for scalable, domain-specific training. The model demonstrates improved performance in aligning with human preferences, particularly in scenarios where traditional curated preference data is costly.

Loading preview...

PUGC-Mistral-DPO: Aligning LLMs with User-Generated Content

This model, PUGC-Mistral-DPO, is a 7 billion parameter language model derived from Mistral-7B-Instruct-v0.2. It was fine-tuned using Direct Preference Optimization (DPO), a method that leverages preference data to align large language models with desired behaviors.

Key Capabilities & Innovations

  • Implicit Preference Learning: The core innovation is the PUGC framework, which generates preference data by extracting implicit human preferences from unlabeled User-Generated Content (UGC). This addresses the high cost and scalability issues associated with traditional curated preference datasets.
  • Enhanced Alignment: By transforming UGC into user queries and using the original UGC as reference text for response scoring, the model is aligned with these implicit preferences, leading to improved response quality.
  • Performance Improvement: Experimental results on Alpaca Eval 2 indicate a 9.37% performance improvement over traditional methods, achieving a 35.93% length-controlled win rate with Mistral-7B-Instruct.
  • Scalable & Domain-Specific Alignment: The PUGC approach enables more scalable and domain-specific alignment, as it can utilize readily available UGC.

Good For

  • Researchers and developers interested in novel methods for LLM alignment using less costly data sources.
  • Applications requiring domain-specific alignment where large amounts of curated preference data are unavailable.
  • Exploring the use of user-generated content to improve model performance and human preference alignment.