W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.4-s_star-0.35-20260430-140517

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Apr 30, 2026Architecture:Transformer Cold

W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.4-s_star-0.35-20260430-140517 is an 8 billion parameter Qwen3-based language model fine-tuned by W-61. This model is a DPO-tuned version of jackf857/qwen3-8b-base-sft-ultrachat-4xh200-batch-128, specifically optimized using the HuggingFaceH4/ultrafeedback_binarized dataset. It is designed for improved response quality and alignment through direct preference optimization.

Loading preview...

Model Overview

This model, developed by W-61, is an 8 billion parameter Qwen3-based language model. It is a fine-tuned iteration of jackf857/qwen3-8b-base-sft-ultrachat-4xh200-batch-128, specifically enhanced through Direct Preference Optimization (DPO).

Key Capabilities

  • Preference Alignment: Optimized using the HuggingFaceH4/ultrafeedback_binarized dataset, indicating a focus on aligning model outputs with human preferences.
  • DPO Fine-tuning: Leverages Direct Preference Optimization for improved response quality and reduced undesirable outputs.

Training Details

The model was trained for 1 epoch with a learning rate of 5e-07 and a total batch size of 128 across 4 GPUs. The training process utilized an AdamW optimizer and a cosine learning rate scheduler with a 0.1 warmup ratio. Evaluation metrics show a final loss of 0.6076, with specific DPO-related metrics like a mean margin of 54.4214 and a chosen log-probability of -331.5330, suggesting effective preference learning.

Good For

  • Applications requiring models with improved alignment to human feedback.
  • Tasks where response quality and preference adherence are critical.