YuchenLi01/ultrafeedbackSkyworkAgree_alignmentZephyr7BSftFull_sdpo_score_ebs128_lr1e-07_0

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Apr 11, 2025Architecture:Transformer Cold

The YuchenLi01/ultrafeedbackSkyworkAgree_alignmentZephyr7BSftFull_sdpo_score_ebs128_lr1e-07_0 model is a 7 billion parameter language model, fine-tuned from alignment-handbook/zephyr-7b-sft-full. It was trained using Direct Preference Optimization (DPO) with the TRL framework, enhancing its ability to align with human preferences. This model is designed for general text generation tasks, particularly those benefiting from preference-based fine-tuning.

Loading preview...

Model Overview

This model, developed by YuchenLi01, is a 7 billion parameter language model derived from the alignment-handbook/zephyr-7b-sft-full base model. It has been specifically fine-tuned using Direct Preference Optimization (DPO), a method detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." This training approach aims to align the model's outputs more closely with human preferences.

Key Capabilities

  • Preference-aligned text generation: Enhanced to produce responses that are preferred by humans, thanks to DPO training.
  • General-purpose language understanding and generation: Suitable for a wide range of conversational and text completion tasks.
  • Built on Zephyr-7B-SFT-Full: Inherits the strong foundational capabilities of its base model.

Training Details

The model was trained using the TRL (Transformer Reinforcement Learning) framework, version 0.12.0. The DPO method leverages preference data to directly optimize the language model, bypassing the need for a separate reward model. This makes it particularly effective for improving response quality based on human feedback.