Name: W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.45-20260430-143919 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: W-61

Model Overview

This model, W-61/qwen3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.45-20260430-143919, is an 8 billion parameter language model. It is a fine-tuned variant of jackf857/qwen3-8b-base-sft-ultrachat-4xh200-batch-128, specifically optimized using Direct Preference Optimization (DPO).

Key Capabilities

Preference Alignment: Fine-tuned on the HuggingFaceH4/ultrafeedback_binarized dataset, indicating an emphasis on aligning model outputs with human preferences.
DPO Optimization: Utilizes the DPO training method, which is known for enhancing the quality and safety of generative models by directly optimizing against human feedback.
Performance Metrics: Achieved a validation loss of 0.5615, with notable DPO metrics including a margin mean of 44.1246 and a chosen logps of -308.8182, suggesting effective preference learning.

Training Details

The model was trained for 1 epoch with a learning rate of 5e-07, using an AdamW optimizer and a cosine learning rate scheduler. The total training batch size was 128 across 4 GPUs, with a gradient accumulation of 8 steps. The training process involved specific hyperparameters aimed at optimizing DPO performance.

Intended Use Cases

This model is suitable for applications where conversational quality, adherence to user instructions, and preference alignment are critical. Its DPO fine-tuning makes it a strong candidate for chatbots, interactive AI assistants, and content generation tasks that benefit from human feedback integration.

Overview

Model Overview

Key Capabilities

Training Details

Intended Use Cases

Full Model Card (README)