jackf857/llama-3-8b-base-r-dpo-ultrafeedback-4xH200-batch-128-rerun-2-runpod

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 30, 2026Architecture:Transformer Cold

The jackf857/llama-3-8b-base-r-dpo-ultrafeedback-4xH200-batch-128-rerun-2-runpod model is an 8 billion parameter Llama 3 base model fine-tuned using Direct Preference Optimization (DPO). It is specifically trained on the HuggingFaceH4/ultrafeedback_binarized dataset, aiming to align its responses with human preferences. This model is suitable for tasks requiring high-quality, preference-aligned text generation based on the Llama 3 architecture.

Loading preview...

Model Overview

This model, jackf857/llama-3-8b-base-r-dpo-ultrafeedback-4xH200-batch-128-rerun-2-runpod, is an 8 billion parameter language model built upon the Llama 3 base architecture. It has been fine-tuned using Direct Preference Optimization (DPO), a method designed to align model outputs with human preferences by learning from chosen and rejected responses.

Key Characteristics

  • Base Model: Fine-tuned from W-61/llama-3-8b-base-sft-ultrachat-8xh200.
  • Training Data: Utilizes the HuggingFaceH4/ultrafeedback_binarized dataset for DPO training, which consists of pairs of preferred and dispreferred responses.
  • Optimization: Employs DPO to enhance the model's ability to generate responses that are more aligned with human feedback and preferences.
  • Training Configuration: Trained with a learning rate of 5e-07, a total batch size of 128, and a cosine learning rate scheduler over 1 epoch.

Potential Use Cases

This model is particularly well-suited for applications where generating high-quality, preference-aligned text is crucial. Its DPO fine-tuning on a feedback dataset suggests improved performance in:

  • Dialogue systems: Generating more helpful and human-like conversational responses.
  • Content generation: Producing text that adheres to specific quality or style preferences.
  • Instruction following: Better understanding and executing user instructions based on learned preferences.