W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.45-20260430-143919
W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.45-20260430-143919 is an 8 billion parameter language model, fine-tuned from jackf857/qwen3-8b-base-sft-ultrachat-4xh200-batch-128 using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset. This model is designed for improved alignment and preference following, demonstrating specific DPO metrics like a margin mean of 44.1246 and a chosen logps of -308.8182 on its evaluation set. It is suitable for applications requiring a robust 8B parameter model with enhanced conversational quality and adherence to user preferences.
Loading preview...
Model Overview
This model, W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.45-20260430-143919, is an 8 billion parameter language model. It is a fine-tuned variant of jackf857/qwen3-8b-base-sft-ultrachat-4xh200-batch-128, specifically optimized using Direct Preference Optimization (DPO).
Key Capabilities
- Preference Alignment: Fine-tuned on the
HuggingFaceH4/ultrafeedback_binarizeddataset, indicating an emphasis on aligning model outputs with human preferences. - DPO Optimization: Utilizes the DPO training method, which is known for enhancing the quality and safety of generative models by directly optimizing against human feedback.
- Performance Metrics: Achieved a validation loss of 0.5615, with notable DPO metrics including a margin mean of 44.1246 and a chosen logps of -308.8182, suggesting effective preference learning.
Training Details
The model was trained for 1 epoch with a learning rate of 5e-07, using an AdamW optimizer and a cosine learning rate scheduler. The total training batch size was 128 across 4 GPUs, with a gradient accumulation of 8 steps. The training process involved specific hyperparameters aimed at optimizing DPO performance.
Intended Use Cases
This model is suitable for applications where conversational quality, adherence to user instructions, and preference alignment are critical. Its DPO fine-tuning makes it a strong candidate for chatbots, interactive AI assistants, and content generation tasks that benefit from human feedback integration.