jackf857/qwen-3-8b-base-r-dpo-ultrafeedback-4xH200-batch-128-rerun-2-runpod
This model is an 8 billion parameter Qwen3-based language model, fine-tuned by jackf857 using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset. It is a refined version of jackf857/qwen3-8b-base-sft-ultrachat-4xh200-batch-128, optimized for improved response quality and alignment through preference learning. The model is suitable for general text generation and conversational AI tasks where nuanced and preferred responses are critical.
Loading preview...
Model Overview
This model, developed by jackf857, is an 8 billion parameter variant based on the Qwen3 architecture. It has undergone a fine-tuning process using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset, building upon a previously supervised fine-tuned version, jackf857/qwen3-8b-base-sft-ultrachat-4xh200-batch-128.
Key Characteristics
- Base Model: Qwen3-8B-Base architecture.
- Fine-tuning Method: Direct Preference Optimization (DPO) for enhanced response quality and alignment.
- Training Data: Utilizes the HuggingFaceH4/ultrafeedback_binarized dataset for preference learning.
- Performance Metrics: Achieved a final loss of 0.5469 on the evaluation set, with specific DPO metrics indicating improved preference alignment.
Training Details
The model was trained with a learning rate of 5e-07, a total batch size of 128 (across 4 devices with 8 gradient accumulation steps), and a cosine learning rate scheduler with a 0.1 warmup ratio over 1 epoch. The training leveraged Transformers 4.51.0, Pytorch 2.3.1+cu121, Datasets 2.21.0, and Tokenizers 0.21.4.
Intended Use Cases
This model is suitable for applications requiring high-quality, aligned text generation, and conversational AI where user preferences are a key factor in desired output. Its DPO fine-tuning suggests improved performance in generating responses that are preferred over alternatives.