W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.4-s_star-0.35-20260430-140517
W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.4-s_star-0.35-20260430-140517 is an 8 billion parameter Qwen3-based language model fine-tuned by W-61. This model is a DPO-tuned version of jackf857/qwen3-8b-base-sft-ultrachat-4xh200-batch-128, specifically optimized using the HuggingFaceH4/ultrafeedback_binarized dataset. It is designed for improved response quality and alignment through direct preference optimization.
Loading preview...
Model Overview
This model, developed by W-61, is an 8 billion parameter Qwen3-based language model. It is a fine-tuned iteration of jackf857/qwen3-8b-base-sft-ultrachat-4xh200-batch-128, specifically enhanced through Direct Preference Optimization (DPO).
Key Capabilities
- Preference Alignment: Optimized using the
HuggingFaceH4/ultrafeedback_binarizeddataset, indicating a focus on aligning model outputs with human preferences. - DPO Fine-tuning: Leverages Direct Preference Optimization for improved response quality and reduced undesirable outputs.
Training Details
The model was trained for 1 epoch with a learning rate of 5e-07 and a total batch size of 128 across 4 GPUs. The training process utilized an AdamW optimizer and a cosine learning rate scheduler with a 0.1 warmup ratio. Evaluation metrics show a final loss of 0.6076, with specific DPO-related metrics like a mean margin of 54.4214 and a chosen log-probability of -331.5330, suggesting effective preference learning.
Good For
- Applications requiring models with improved alignment to human feedback.
- Tasks where response quality and preference adherence are critical.