HCY123902/llama-3-8b-dpo-tw31-beta-1e-0-ift

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 21, 2026Architecture:Transformer Cold

HCY123902/llama-3-8b-dpo-tw31-beta-1e-0-ift is an 8 billion parameter Llama 3-based language model, fine-tuned using Direct Preference Optimization (DPO) on a base model from Princeton NLP. This model is designed to generate responses aligned with human preferences, leveraging the DPO method for improved conversational quality. It is suitable for general text generation tasks where preference alignment is beneficial.

Loading preview...

Model Overview

HCY123902/llama-3-8b-dpo-tw31-beta-1e-0-ift is an 8 billion parameter language model built upon the Llama 3 architecture, specifically fine-tuned from princeton-nlp/Llama-3-Base-8B-SFT. This model leverages the Direct Preference Optimization (DPO) method, a technique introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," to enhance its ability to generate human-preferred responses.

Key Capabilities

  • Preference-aligned text generation: Trained with DPO, the model is optimized to produce outputs that align more closely with human preferences, making it suitable for interactive and conversational applications.
  • Llama 3 foundation: Benefits from the robust base capabilities of the Llama 3 8B model, providing a strong foundation for various natural language processing tasks.
  • Instruction-following: As a fine-tuned model, it is expected to follow instructions effectively, building on its base SFT training.

Training Details

The model was trained using the TRL library (version 0.20.0) with DPO. This training approach aims to directly optimize a policy to maximize the likelihood of preferred responses over dispreferred ones, without the need for an explicit reward model. The training process utilized Transformers 4.54.1 and PyTorch 2.7.1+cu128.

Good For

  • General-purpose text generation where human preference alignment is desired.
  • Applications requiring nuanced and contextually appropriate responses.
  • Exploration of DPO-tuned models based on the Llama 3 architecture.