W-61/llama-3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.45-20260427-221551

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 28, 2026Architecture:Transformer Cold

W-61/llama-3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.45-20260427-221551 is an 8 billion parameter language model developed by W-61, fine-tuned from a Llama 3 base model. This model was further optimized using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset. It is designed for tasks benefiting from preference-based fine-tuning, showing specific performance metrics on evaluation sets related to DPO training.

Loading preview...

Model Overview

This model, developed by W-61, is an 8 billion parameter language model derived from a Llama 3 base architecture. It has undergone a specific fine-tuning process using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset.

Key Characteristics

  • Base Model: Fine-tuned from W-61/llama-3-8b-base-sft-ultrachat-8xh200.
  • Optimization Method: Utilizes Direct Preference Optimization (DPO) for alignment.
  • Training Data: Optimized on the HuggingFaceH4/ultrafeedback_binarized dataset.
  • Evaluation Metrics: Achieved a loss of 0.5654 and specific DPO-related metrics on its evaluation set, including a margin mean of 76.3970.

Training Details

The model was trained with a learning rate of 5e-07, a total batch size of 128, and for 1 epoch. The optimizer used was ADAMW_TORCH with a cosine learning rate scheduler.

Intended Use Cases

While specific intended uses are not detailed in the provided information, models fine-tuned with DPO on preference datasets are generally suitable for tasks requiring nuanced understanding of human preferences, such as instruction following, dialogue generation, and content moderation, where alignment with desired outputs is critical.