W-61/llama-3-8b-base-beta-dpo-ultrafeedback-4xh200-batch-128-20260424-044124
W-61/llama-3-8b-base-beta-dpo-ultrafeedback-4xh200-batch-128-20260424-044124 is an 8 billion parameter Llama 3 base model fine-tuned using Direct Preference Optimization (DPO). This model was trained on the HuggingFaceH4/ultrafeedback_binarized dataset, focusing on aligning its responses with human preferences. It is intended for applications requiring high-quality, preference-aligned text generation.
Loading preview...
Overview
This model, llama-3-8b-base-beta-dpo-ultrafeedback-4xh200-batch-128-20260424-044124, is an 8 billion parameter Llama 3 base model that has undergone Direct Preference Optimization (DPO). It is a fine-tuned version of W-61/llama-3-8b-base-sft-ultrachat-8xh200 and was trained on the HuggingFaceH4/ultrafeedback_binarized dataset.
Key Characteristics
- Base Model: Llama 3 8B parameters.
- Fine-tuning Method: Direct Preference Optimization (DPO).
- Training Data: HuggingFaceH4/ultrafeedback_binarized dataset, indicating a focus on aligning model outputs with human preferences.
- Context Length: 8192 tokens.
Training Details
The model was trained with a learning rate of 5e-07, a total batch size of 128 (across 4 multi-GPU devices), and utilized a cosine learning rate scheduler with a 0.1 warmup ratio over 1 epoch. Evaluation metrics during training included a final validation loss of 0.6357 and a Beta DPO gap mean of 28.0227, suggesting effective preference alignment.
Potential Use Cases
Given its DPO fine-tuning on a feedback dataset, this model is likely suitable for:
- Generating responses that are preferred by humans.
- Applications requiring high-quality, aligned text outputs.
- Tasks where nuanced understanding of human preferences is beneficial.