W-61/llama3-8b-dpo-4xh100-pilot

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Mar 28, 2026Architecture:Transformer Cold

W-61/llama3-8b-dpo-4xh100-pilot is an 8 billion parameter language model fine-tuned from princeton-nlp/Llama-3-Base-8B-SFT. This model utilizes Direct Preference Optimization (DPO) for enhanced performance, trained with the TRL framework. It is designed for general text generation tasks, leveraging its DPO training to align with human preferences. The model supports a context length of 8192 tokens.

Loading preview...

Model Overview

W-61/llama3-8b-dpo-4xh100-pilot is an 8 billion parameter language model, fine-tuned from the princeton-nlp/Llama-3-Base-8B-SFT base model. It has been specifically trained using Direct Preference Optimization (DPO), a method detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". This training approach aims to align the model's outputs more closely with human preferences.

Key Capabilities

  • Preference-aligned text generation: Benefits from DPO training to produce outputs that are generally preferred by humans.
  • Base model: Built upon the Llama-3 architecture, providing a strong foundation for various NLP tasks.
  • TRL framework: Developed using the Transformer Reinforcement Learning (TRL) library, indicating a focus on advanced fine-tuning techniques.

Training Details

The model's training procedure involved DPO, leveraging TRL version 0.19.1, Transformers 4.57.6, Pytorch 2.6.0+cu126, Datasets 4.8.4, and Tokenizers 0.22.2. The training process can be visualized via Weights & Biases, as indicated in the original model card.

Good For

  • Applications requiring text generation with improved human preference alignment.
  • Further experimentation with DPO-trained Llama-3 models.
  • General-purpose conversational AI and content creation where nuanced responses are valued.