li-muyang/zephyr-8b-dpo-full
li-muyang/zephyr-8b-dpo-full is an 8 billion parameter language model fine-tuned from meta-llama/Llama-3.1-8B. This model was trained using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset, focusing on aligning model outputs with human preferences. It is optimized for generating responses that are preferred over rejected alternatives, making it suitable for conversational AI and instruction-following tasks.
Loading preview...
Model Overview
li-muyang/zephyr-8b-dpo-full is an 8 billion parameter language model derived from the meta-llama/Llama-3.1-8B architecture. It has been fine-tuned using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset. This training approach aims to align the model's outputs more closely with human preferences by learning from pairs of chosen and rejected responses.
Training Details
The model was trained with a learning rate of 5e-07, a batch size of 4, and a total effective batch size of 128 across 8 GPUs. The training process involved 1 epoch, utilizing an Adam optimizer and a cosine learning rate scheduler with a 0.1 warmup ratio. Evaluation metrics from the training process indicate a rewards accuracy of 0.7656, suggesting its effectiveness in distinguishing preferred responses.
Key Characteristics
- Base Model: meta-llama/Llama-3.1-8B
- Fine-tuning Method: Direct Preference Optimization (DPO)
- Dataset: HuggingFaceH4/ultrafeedback_binarized
- Parameter Count: 8 billion
Potential Use Cases
This model is particularly well-suited for applications requiring:
- Preference-aligned text generation: Producing outputs that are generally favored by human evaluators.
- Conversational AI: Generating more natural and helpful dialogue responses.
- Instruction following: Adhering to user instructions with improved quality compared to base models.