W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.6-20260430-165125

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Apr 30, 2026Architecture:Transformer Cold

W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.6-20260430-165125 is an 8 billion parameter language model, fine-tuned from jackf857/qwen3-8b-base-sft-ultrachat-4xh200-batch-128. This model was further optimized using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset, enhancing its ability to align with human preferences. With a context length of 32768 tokens, it is designed for conversational AI and instruction-following tasks where human-like responses are critical.

Loading preview...

Model Overview

This model, W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.6-20260430-165125, is an 8 billion parameter language model. It is a fine-tuned iteration of jackf857/qwen3-8b-base-sft-ultrachat-4xh200-batch-128, specifically optimized using Direct Preference Optimization (DPO).

Key Capabilities

  • Preference Alignment: Enhanced through DPO training on the HuggingFaceH4/ultrafeedback_binarized dataset, suggesting improved alignment with human preferences and instruction following.
  • Base Architecture: Built upon a Qwen3-8B base, providing a robust foundation for various natural language processing tasks.
  • Context Length: Supports a substantial context window of 32768 tokens, enabling processing of longer inputs and generating more coherent, extended responses.

Training Details

The model underwent a single epoch of training with a learning rate of 5e-07, utilizing a total batch size of 128 across 4 GPUs. The optimizer used was ADAMW_TORCH with a cosine learning rate scheduler and a 0.1 warmup ratio. This training regimen aims to refine the model's conversational abilities and response quality.

Good for

  • Conversational AI: Its DPO fine-tuning makes it suitable for chatbots and interactive agents that require nuanced, human-aligned responses.
  • Instruction Following: Expected to perform well in tasks where precise adherence to user instructions is crucial.
  • Applications requiring longer context: The 32K context window is beneficial for summarizing long documents, extended dialogue, or complex reasoning over large texts.