W-61/qwen3-8b-base-epsilon-dpo-ultrafeedback-4xh200-batch-128-20260422-131855
W-61/qwen3-8b-base-epsilon-dpo-ultrafeedback-4xh200-batch-128-20260422-131855 is an 8 billion parameter language model, fine-tuned by W-61 using DPO on the HuggingFaceH4/ultrafeedback_binarized dataset. This model is a direct fine-tune of W-61/qwen3-8b-base-sft-ultrachat-4xh200-batch-128, demonstrating improved reward metrics with a rewards/accuracies score of 0.7165. It is optimized for generating responses aligned with human preferences, making it suitable for conversational AI and instruction-following tasks.
Loading preview...
Model Overview
This model, qwen3-8b-base-epsilon-dpo-ultrafeedback-4xh200-batch-128-20260422-131855, is an 8 billion parameter language model developed by W-61. It is a fine-tuned variant of the W-61/qwen3-8b-base-sft-ultrachat-4xh200-batch-128 base model, specifically optimized using Direct Preference Optimization (DPO).
Key Capabilities
- Preference Alignment: Fine-tuned on the
HuggingFaceH4/ultrafeedback_binarizeddataset, indicating a focus on aligning model outputs with human preferences. - Improved Reward Metrics: Achieved a rewards/accuracies score of 0.7165, with rewards/chosen at -0.1262 and rewards/rejected at -0.2486, suggesting better discrimination between preferred and rejected responses.
Training Details
The model was trained with a learning rate of 5e-07, a total batch size of 128 (across 4 GPUs with 8 gradient accumulation steps), and for 1 epoch. The optimizer used was ADAMW_TORCH with a cosine learning rate scheduler and 0.1 warmup ratio. The training process resulted in a validation loss of 0.6398.