YuchenLi01/ultrafeedbackSkyworkAgree_alignmentZephyr7BSftFull_sdpo_score_ebs128_lr1e-06_2

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Apr 10, 2025Architecture:Transformer Warm

YuchenLi01/ultrafeedbackSkyworkAgree_alignmentZephyr7BSftFull_sdpo_score_ebs128_lr1e-06_2 is a 7 billion parameter language model fine-tuned from alignment-handbook/zephyr-7b-sft-full. This model was trained using Direct Preference Optimization (DPO) via the TRL library, enhancing its ability to align with human preferences. It is designed for generating high-quality, preference-aligned text responses, making it suitable for conversational AI and instruction-following tasks.

Loading preview...

Model Overview

YuchenLi01/ultrafeedbackSkyworkAgree_alignmentZephyr7BSftFull_sdpo_score_ebs128_lr1e-06_2 is a 7 billion parameter language model built upon the alignment-handbook/zephyr-7b-sft-full base model. It has been further fine-tuned using the Direct Preference Optimization (DPO) method, as detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". This training approach aims to align the model's outputs more closely with human preferences without requiring a separate reward model.

Key Capabilities

  • Preference-aligned text generation: Optimized to produce responses that are preferred by humans, making it suitable for interactive applications.
  • Instruction following: Capable of generating coherent and relevant text based on user prompts.
  • Built on Zephyr-7B-SFT-Full: Leverages the strong foundational capabilities of its base model.

Training Details

The model was trained using the TRL library (version 0.12.0) with DPO. The training process utilized specific versions of key frameworks including Transformers (4.46.3), Pytorch (2.3.0), Datasets (3.1.0), and Tokenizers (0.20.3).

Good For

  • Developing conversational agents that require preference-aligned responses.
  • Applications where human-like quality and alignment are crucial.
  • Research into DPO and preference-based fine-tuning methods.