W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.43-s_star-0.4-20260429-230725
W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.43-s_star-0.4-20260429-230725 is an 8 billion parameter language model, fine-tuned from jackf857/qwen3-8b-base-sft-ultrachat-4xh200-batch-128 using the HuggingFaceH4/ultrafeedback_binarized dataset. This model is optimized through Direct Preference Optimization (DPO) to align with human preferences, demonstrating improved response quality over its base SFT version. It is suitable for applications requiring nuanced and preference-aligned text generation within a 32768 token context window.
Loading preview...
Model Overview
This model, W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.43-s_star-0.4-20260429-230725, is an 8 billion parameter language model. It is a fine-tuned variant of jackf857/qwen3-8b-base-sft-ultrachat-4xh200-batch-128, specifically optimized using the Direct Preference Optimization (DPO) method.
Training Details
The model was trained on the HuggingFaceH4/ultrafeedback_binarized dataset. Key training hyperparameters included a learning rate of 5e-07, a train_batch_size of 4, and a gradient_accumulation_steps of 8, resulting in a total_train_batch_size of 128. The training utilized a cosine learning rate scheduler with a 0.1 warmup ratio over 1 epoch. Evaluation metrics show a validation loss of 0.5897 and a DPO margin mean of 51.0513, indicating effective preference alignment.
Key Characteristics
- Parameter Count: 8 billion parameters.
- Context Length: Supports a context window of 32768 tokens.
- Optimization Method: Fine-tuned using Direct Preference Optimization (DPO) for enhanced alignment with human feedback.
Potential Use Cases
This model is well-suited for applications where generating responses that align with human preferences is crucial. Its DPO fine-tuning suggests improved conversational quality and adherence to desired output styles compared to models trained solely with Supervised Fine-Tuning (SFT).