W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.35-20260430-143919

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Apr 30, 2026Architecture:Transformer Cold

W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.35-20260430-143919 is an 8 billion parameter Qwen3-based language model fine-tuned using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset. This model is optimized for generating responses aligned with human preferences, building upon a base model that was previously instruction-tuned. It is suitable for applications requiring high-quality, preference-aligned text generation.

Loading preview...

Model Overview

This model, W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.35-20260430-143919, is an 8 billion parameter language model built on the Qwen3 architecture. It has been fine-tuned using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset, enhancing its ability to generate human-preferred responses. The training process involved a base model, jackf857/qwen3-8b-base-sft-ultrachat-4xh200-batch-128, which was previously instruction-tuned.

Key Training Details

  • Fine-tuning Method: Direct Preference Optimization (DPO)
  • Dataset: HuggingFaceH4/ultrafeedback_binarized
  • Base Model: jackf857/qwen3-8b-base-sft-ultrachat-4xh200-batch-128
  • Context Length: 32768 tokens
  • Hyperparameters: Training utilized a learning rate of 5e-07, a total batch size of 128 across 4 GPUs, and a cosine learning rate scheduler with 0.1 warmup ratio over 1 epoch.

Performance Metrics

During evaluation, the model achieved a validation loss of 0.5890. Key DPO-specific metrics include a Fcm Dpo/beta of 0.0056 and a Margin Dpo/margin Mean of 51.3408, indicating effective preference alignment during training.

Intended Use Cases

This model is particularly well-suited for applications where generating text that aligns with human preferences is crucial. Its DPO fine-tuning makes it a strong candidate for tasks requiring nuanced and preferred responses, such as advanced chatbots, content generation, and interactive AI systems.