Model Overview
The li-muyang/zephyr-7b-gemma-dpo is an 8 billion parameter language model built upon the Gemma architecture. It has been fine-tuned using the Direct Preference Optimization (DPO) method, which aligns the model with human preferences by distinguishing between chosen and rejected responses.
Training Details
The model was trained from scratch over 2 epochs using a learning rate of 5e-07 and a total batch size of 128 across 8 GPUs. Key training metrics include a final validation loss of 0.5287, with rewards for chosen responses at 0.4753 and rejected responses at -0.1181, indicating successful preference learning. The training procedure utilized Adam optimizer with specific beta and epsilon values, and a cosine learning rate scheduler with a 0.1 warmup ratio.
Key Characteristics
- Architecture: Gemma-based, 8 billion parameters.
- Fine-tuning: Direct Preference Optimization (DPO) for alignment.
- Performance Indicators: Achieved a reward accuracy of 0.7708, demonstrating its ability to differentiate preferred outputs.
Intended Use Cases
While specific intended uses and limitations require further definition, models fine-tuned with DPO are generally well-suited for tasks requiring nuanced understanding of user preferences, such as instruction following, dialogue generation, and content summarization where quality and alignment are paramount.