Name: W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.35-20260430-143919 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: W-61

Model Overview

This model, W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.35-20260430-143919, is an 8 billion parameter language model built on the Qwen3 architecture. It has been fine-tuned using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset, enhancing its ability to generate human-preferred responses. The training process involved a base model, jackf857/qwen3-8b-base-sft-ultrachat-4xh200-batch-128, which was previously instruction-tuned.

Key Training Details

Fine-tuning Method: Direct Preference Optimization (DPO)
Dataset: HuggingFaceH4/ultrafeedback_binarized
Base Model: jackf857/qwen3-8b-base-sft-ultrachat-4xh200-batch-128
Context Length: 32768 tokens
Hyperparameters: Training utilized a learning rate of 5e-07, a total batch size of 128 across 4 GPUs, and a cosine learning rate scheduler with 0.1 warmup ratio over 1 epoch.

Performance Metrics

During evaluation, the model achieved a validation loss of 0.5890. Key DPO-specific metrics include a Fcm Dpo/beta of 0.0056 and a Margin Dpo/margin Mean of 51.3408, indicating effective preference alignment during training.

Intended Use Cases

This model is particularly well-suited for applications where generating text that aligns with human preferences is crucial. Its DPO fine-tuning makes it a strong candidate for tasks requiring nuanced and preferred responses, such as advanced chatbots, content generation, and interactive AI systems.

Overview

Model Overview

Key Training Details

Performance Metrics

Intended Use Cases

Full Model Card (README)