jackf857/llama-3-8b-base-robust-dpo-ultrafeedback-8xh200

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 14, 2026Architecture:Transformer Cold

The jackf857/llama-3-8b-base-robust-dpo-ultrafeedback-8xh200 is an 8 billion parameter Llama 3 base model, fine-tuned using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset. This model is designed to improve response quality and alignment by learning from human preferences. It builds upon a Llama 3 8B base model that was previously instruction-tuned, making it suitable for general conversational AI and preference-aligned text generation tasks.

Loading preview...

Model Overview

This model, jackf857/llama-3-8b-base-robust-dpo-ultrafeedback-8xh200, is an 8 billion parameter language model based on the Llama 3 architecture. It has been fine-tuned using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset, which is designed to align models with human preferences.

Key Characteristics

  • Base Model: Fine-tuned from W-61/llama-3-8b-base-sft-ultrachat-8xh200, indicating a foundation that was already instruction-tuned.
  • Optimization Method: Utilizes Direct Preference Optimization (DPO) to enhance response quality and alignment.
  • Training Data: Fine-tuned on the HuggingFaceH4/ultrafeedback_binarized dataset, which provides preference data for alignment.
  • Performance Metrics: Achieved a validation loss of 0.3504, with a rewards accuracy of 0.7460, indicating its ability to differentiate between preferred and rejected responses.

Training Details

The model was trained with a learning rate of 5e-07, a total batch size of 128 (across 8 GPUs), and for 1 epoch. The training process used an AdamW optimizer and a cosine learning rate scheduler with a 0.1 warmup ratio.

Intended Use Cases

This model is well-suited for applications requiring a robust, preference-aligned language model. Its DPO fine-tuning makes it particularly effective for generating responses that are more helpful, harmless, and aligned with human feedback, making it a strong candidate for conversational agents, content generation, and instruction-following tasks where quality and alignment are critical.