W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.43-s_star-0.4-20260429-230725

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Apr 30, 2026Architecture:Transformer Cold

W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.43-s_star-0.4-20260429-230725 is an 8 billion parameter language model, fine-tuned from jackf857/qwen3-8b-base-sft-ultrachat-4xh200-batch-128 using the HuggingFaceH4/ultrafeedback_binarized dataset. This model is optimized through Direct Preference Optimization (DPO) to align with human preferences, demonstrating improved response quality over its base SFT version. It is suitable for applications requiring nuanced and preference-aligned text generation within a 32768 token context window.

Loading preview...

Model Overview

This model, W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.43-s_star-0.4-20260429-230725, is an 8 billion parameter language model. It is a fine-tuned variant of jackf857/qwen3-8b-base-sft-ultrachat-4xh200-batch-128, specifically optimized using the Direct Preference Optimization (DPO) method.

Training Details

The model was trained on the HuggingFaceH4/ultrafeedback_binarized dataset. Key training hyperparameters included a learning rate of 5e-07, a train_batch_size of 4, and a gradient_accumulation_steps of 8, resulting in a total_train_batch_size of 128. The training utilized a cosine learning rate scheduler with a 0.1 warmup ratio over 1 epoch. Evaluation metrics show a validation loss of 0.5897 and a DPO margin mean of 51.0513, indicating effective preference alignment.

Key Characteristics

  • Parameter Count: 8 billion parameters.
  • Context Length: Supports a context window of 32768 tokens.
  • Optimization Method: Fine-tuned using Direct Preference Optimization (DPO) for enhanced alignment with human feedback.

Potential Use Cases

This model is well-suited for applications where generating responses that align with human preferences is crucial. Its DPO fine-tuning suggests improved conversational quality and adherence to desired output styles compared to models trained solely with Supervised Fine-Tuning (SFT).