Name: W-61/llama-3-8b-base-beta-dpo-ultrafeedback-4xh200-batch-128-20260424-044124 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: W-61

Overview

This model, llama-3-8b-base-beta-dpo-ultrafeedback-4xh200-batch-128-20260424-044124, is an 8 billion parameter Llama 3 base model that has undergone Direct Preference Optimization (DPO). It is a fine-tuned version of W-61/llama-3-8b-base-sft-ultrachat-8xh200 and was trained on the HuggingFaceH4/ultrafeedback_binarized dataset.

Key Characteristics

Base Model: Llama 3 8B parameters.
Fine-tuning Method: Direct Preference Optimization (DPO).
Training Data: HuggingFaceH4/ultrafeedback_binarized dataset, indicating a focus on aligning model outputs with human preferences.
Context Length: 8192 tokens.

Training Details

The model was trained with a learning rate of 5e-07, a total batch size of 128 (across 4 multi-GPU devices), and utilized a cosine learning rate scheduler with a 0.1 warmup ratio over 1 epoch. Evaluation metrics during training included a final validation loss of 0.6357 and a Beta DPO gap mean of 28.0227, suggesting effective preference alignment.

Potential Use Cases

Given its DPO fine-tuning on a feedback dataset, this model is likely suitable for:

Generating responses that are preferred by humans.
Applications requiring high-quality, aligned text outputs.
Tasks where nuanced understanding of human preferences is beneficial.

Overview

Overview

Key Characteristics

Training Details

Potential Use Cases

Full Model Card (README)