W-61/llama-3-8b-base-beta-dpo-ultrafeedback-4xh200-batch-128-20260424-044124

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 24, 2026Architecture:Transformer Cold

W-61/llama-3-8b-base-beta-dpo-ultrafeedback-4xh200-batch-128-20260424-044124 is an 8 billion parameter Llama 3 base model fine-tuned using Direct Preference Optimization (DPO). This model was trained on the HuggingFaceH4/ultrafeedback_binarized dataset, focusing on aligning its responses with human preferences. It is intended for applications requiring high-quality, preference-aligned text generation.

Loading preview...

Overview

This model, llama-3-8b-base-beta-dpo-ultrafeedback-4xh200-batch-128-20260424-044124, is an 8 billion parameter Llama 3 base model that has undergone Direct Preference Optimization (DPO). It is a fine-tuned version of W-61/llama-3-8b-base-sft-ultrachat-8xh200 and was trained on the HuggingFaceH4/ultrafeedback_binarized dataset.

Key Characteristics

  • Base Model: Llama 3 8B parameters.
  • Fine-tuning Method: Direct Preference Optimization (DPO).
  • Training Data: HuggingFaceH4/ultrafeedback_binarized dataset, indicating a focus on aligning model outputs with human preferences.
  • Context Length: 8192 tokens.

Training Details

The model was trained with a learning rate of 5e-07, a total batch size of 128 (across 4 multi-GPU devices), and utilized a cosine learning rate scheduler with a 0.1 warmup ratio over 1 epoch. Evaluation metrics during training included a final validation loss of 0.6357 and a Beta DPO gap mean of 28.0227, suggesting effective preference alignment.

Potential Use Cases

Given its DPO fine-tuning on a feedback dataset, this model is likely suitable for:

  • Generating responses that are preferred by humans.
  • Applications requiring high-quality, aligned text outputs.
  • Tasks where nuanced understanding of human preferences is beneficial.