jackf857/llama-3-8b-base-r-dpo-ultrafeedback-4xh200
The jackf857/llama-3-8b-base-r-dpo-ultrafeedback-4xh200 is an 8 billion parameter language model, fine-tuned from W-61/llama-3-8b-base-sft-ultrachat-8xh200 using the HuggingFaceH4/ultrafeedback_binarized dataset. This model specializes in response generation, leveraging Direct Preference Optimization (DPO) to align with human preferences. It is designed for tasks requiring nuanced and high-quality conversational outputs, building upon a base Llama 3 architecture with an 8192 token context length.
Loading preview...
Overview
This model, jackf857/llama-3-8b-base-r-dpo-ultrafeedback-4xh200, is an 8 billion parameter language model derived from the Llama 3 architecture. It has been fine-tuned using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset, building upon the W-61/llama-3-8b-base-sft-ultrachat-8xh200 base model.
Key Capabilities
- Preference Alignment: Optimized through DPO to generate responses that align with human preferences, as indicated by the ultrafeedback dataset.
- Conversational Generation: Inherits and refines capabilities for generating coherent and contextually relevant conversational outputs.
- Base Model Enhancement: Improves upon a supervised fine-tuned (SFT) Llama 3 base model, focusing on refining response quality.
Training Details
The model was trained with a learning rate of 5e-07 over 1 epoch, utilizing a total batch size of 128 across 4 GPUs. The training process achieved a final validation loss of 0.5080. Key DPO metrics, such as R Dpo/chosen Len (291.2620) and R Dpo/rejected Len (248.3960), indicate the model's preference for longer, more detailed chosen responses over rejected ones.
Intended Use Cases
This model is suitable for applications requiring high-quality, preference-aligned text generation, particularly in conversational AI, chatbots, and interactive systems where response quality and human-like interaction are crucial. Its DPO fine-tuning makes it well-suited for tasks where distinguishing between good and bad responses is important.