jackf857/llama-3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.4-s_star-0.5
The jackf857/llama-3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.4-s_star-0.5 is an 8 billion parameter language model, fine-tuned from W-61/llama-3-8b-base-sft-ultrachat-8xh200. It was trained using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset, demonstrating improved alignment based on human preferences. This model is suitable for tasks requiring nuanced response generation and preference-aligned outputs, leveraging its 8192 token context length.
Loading preview...
Model Overview
This model, jackf857/llama-3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.4-s_star-0.5, is an 8 billion parameter language model derived from W-61/llama-3-8b-base-sft-ultrachat-8xh200. It has been fine-tuned using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset, aiming to enhance its ability to generate responses that align with human preferences.
Training Details
The model underwent a single epoch of DPO training with a learning rate of 5e-07 and a total effective batch size of 128. Key training metrics include a final loss of 0.5929 and a DPO margin mean of 96.9446, indicating its performance in distinguishing between preferred and rejected responses. The training utilized 4 GPUs with a gradient accumulation of 8 steps.
Potential Use Cases
Given its DPO fine-tuning on a preference dataset, this model is likely well-suited for applications where:
- Preference-aligned response generation is critical.
- Instruction following and generating helpful, harmless, and honest outputs are desired.
- Refining outputs based on implicit human feedback is beneficial.
Limitations
The model card indicates that more information is needed regarding its specific intended uses, limitations, and detailed training/evaluation data. Users should conduct further evaluation for their specific applications.