W-61/llama-3-8b-base-margin-dpo-4xh100-real
W-61/llama-3-8b-base-margin-dpo-4xh100-real is an 8 billion parameter language model fine-tuned from princeton-nlp/Llama-3-Base-8B-SFT. This model was trained using DPO on the HuggingFaceH4/ultrafeedback_binarized dataset, indicating an optimization for instruction following and preference alignment. It is designed for general language generation tasks where a refined response based on human feedback is beneficial.
Loading preview...
Overview
W-61/llama-3-8b-base-margin-dpo-4xh100-real is an 8 billion parameter language model derived from the princeton-nlp/Llama-3-Base-8B-SFT base model. It has been fine-tuned using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset, which typically involves training on human preference data to improve response quality and alignment with user instructions.
Key Characteristics
- Base Model: Llama-3-Base-8B-SFT, providing a strong foundation for language understanding and generation.
- Fine-tuning Method: Direct Preference Optimization (DPO), aimed at enhancing the model's ability to generate preferred responses based on human feedback.
- Training Data: HuggingFaceH4/ultrafeedback_binarized dataset, a common choice for preference alignment tasks.
- Context Length: Supports an 8192 token context window.
Training Details
The model was trained with a learning rate of 5e-07, a total batch size of 128 (achieved with 4 GPUs and 16 gradient accumulation steps), and for 1 epoch. The optimizer used was Adam with standard betas and epsilon, and a cosine learning rate scheduler with a 0.05 warmup ratio.
Potential Use Cases
This model is likely suitable for applications requiring high-quality, aligned text generation, such as:
- Instruction following chatbots.
- Content generation that adheres to specific stylistic or factual preferences.
- Tasks where human-like response quality is prioritized.