Name: W-61/llama-3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.35-20260428-045924 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: W-61

Overview

This model, developed by W-61, is an 8 billion parameter language model fine-tuned using Direct Preference Optimization (DPO). It is based on the W-61/llama-3-8b-base-sft-ultrachat-8xh200 model and was trained on the HuggingFaceH4/ultrafeedback_binarized dataset.

Key Training Details

Fine-tuning Method: Direct Preference Optimization (DPO)
Base Model: W-61/llama-3-8b-base-sft-ultrachat-8xh200
Dataset: HuggingFaceH4/ultrafeedback_binarized
Context Length: 8192 tokens
Training Hyperparameters:
- Learning Rate: 5e-07
- Optimizer: ADAMW_TORCH
- Epochs: 1
- Total Train Batch Size: 128

Performance Metrics

During training, the model achieved a final validation loss of 0.5985. Key DPO-specific metrics include a margin mean of 99.3391 and a KL/mean of -301.0162, indicating its performance in aligning with human preferences as captured by the DPO objective.

Intended Use Cases

While specific intended uses and limitations are not detailed in the README, models fine-tuned with DPO on preference datasets are generally suitable for tasks requiring high-quality, human-aligned text generation, such as instruction following, dialogue, and content creation where user preference is a key factor.

Overview

Overview

Key Training Details

Performance Metrics

Intended Use Cases

Full Model Card (README)