YuchenLi01/ultrafeedbackSkyworkAgree_alignmentZephyr7BSftFull_sdpo_score_ebs128_lr1e-07_2

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Apr 10, 2025Architecture:Transformer Cold

YuchenLi01/ultrafeedbackSkyworkAgree_alignmentZephyr7BSftFull_sdpo_score_ebs128_lr1e-07_2 is a 7 billion parameter language model, fine-tuned from alignment-handbook/zephyr-7b-sft-full. This model was trained using Direct Preference Optimization (DPO) with TRL, enhancing its ability to align with human preferences. It is designed for text generation tasks where preference alignment is crucial, offering improved response quality based on explicit feedback.

Loading preview...

Model Overview

This model, YuchenLi01/ultrafeedbackSkyworkAgree_alignmentZephyr7BSftFull_sdpo_score_ebs128_lr1e-07_2, is a 7 billion parameter language model built upon the foundation of alignment-handbook/zephyr-7b-sft-full. It has been specifically fine-tuned using Direct Preference Optimization (DPO), a method that leverages human preference data to improve model alignment without requiring a separate reward model. The training was conducted using the TRL (Transformer Reinforcement Learning) library.

Key Capabilities

  • Preference Alignment: Enhanced to generate responses that better align with human preferences, thanks to DPO training.
  • Text Generation: Capable of various text generation tasks, producing outputs informed by preference-based learning.
  • Fine-tuned from Zephyr-7B-SFT-Full: Benefits from the strong base capabilities of its parent model, further refined for alignment.

Training Details

The model's training procedure involved:

  • Methodology: Direct Preference Optimization (DPO), as detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model".
  • Framework: Utilized the TRL library (version 0.12.0) for the training process.
  • Dependencies: Built with Transformers 4.46.3, Pytorch 2.3.0, Datasets 3.1.0, and Tokenizers 0.20.3.

Good For

  • Applications requiring high-quality, preference-aligned text generation.
  • Scenarios where models need to understand and adhere to human feedback implicitly through DPO.
  • Developers looking for a 7B parameter model with strong conversational or instructional capabilities refined by preference learning.