Name: W-61/qwen3-8b-base-epsilon-dpo-ultrafeedback-4xh200-batch-128-20260422-131855 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: W-61

Model Overview

This model, qwen3-8b-base-epsilon-dpo-ultrafeedback-4xh200-batch-128-20260422-131855, is an 8 billion parameter language model developed by W-61. It is a fine-tuned variant of the W-61/qwen3-8b-base-sft-ultrachat-4xh200-batch-128 base model, specifically optimized using Direct Preference Optimization (DPO).

Key Capabilities

Preference Alignment: Fine-tuned on the HuggingFaceH4/ultrafeedback_binarized dataset, indicating a focus on aligning model outputs with human preferences.
Improved Reward Metrics: Achieved a rewards/accuracies score of 0.7165, with rewards/chosen at -0.1262 and rewards/rejected at -0.2486, suggesting better discrimination between preferred and rejected responses.

Training Details

The model was trained with a learning rate of 5e-07, a total batch size of 128 (across 4 GPUs with 8 gradient accumulation steps), and for 1 epoch. The optimizer used was ADAMW_TORCH with a cosine learning rate scheduler and 0.1 warmup ratio. The training process resulted in a validation loss of 0.6398.

Overview

Model Overview

Key Capabilities

Training Details

Full Model Card (README)