W-61/llama-3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.45-20260427-221551
W-61/llama-3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.45-20260427-221551 is an 8 billion parameter language model developed by W-61, fine-tuned from a Llama 3 base model. This model was further optimized using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset. It is designed for tasks benefiting from preference-based fine-tuning, showing specific performance metrics on evaluation sets related to DPO training.
Loading preview...
Model Overview
This model, developed by W-61, is an 8 billion parameter language model derived from a Llama 3 base architecture. It has undergone a specific fine-tuning process using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset.
Key Characteristics
- Base Model: Fine-tuned from
W-61/llama-3-8b-base-sft-ultrachat-8xh200. - Optimization Method: Utilizes Direct Preference Optimization (DPO) for alignment.
- Training Data: Optimized on the
HuggingFaceH4/ultrafeedback_binarizeddataset. - Evaluation Metrics: Achieved a loss of 0.5654 and specific DPO-related metrics on its evaluation set, including a margin mean of 76.3970.
Training Details
The model was trained with a learning rate of 5e-07, a total batch size of 128, and for 1 epoch. The optimizer used was ADAMW_TORCH with a cosine learning rate scheduler.
Intended Use Cases
While specific intended uses are not detailed in the provided information, models fine-tuned with DPO on preference datasets are generally suitable for tasks requiring nuanced understanding of human preferences, such as instruction following, dialogue generation, and content moderation, where alignment with desired outputs is critical.