Name: W-61/llama-3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-s_star-0.4-20260425-111846 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: W-61

Overview

This model, developed by W-61, is an 8 billion parameter Llama 3 base model that has undergone fine-tuning using Direct Preference Optimization (DPO). The training utilized the HuggingFaceH4/ultrafeedback_binarized dataset, aiming to align the model's outputs more closely with human preferences.

Key Capabilities

Preference-aligned Generation: Fine-tuned with DPO to produce responses that are preferred over rejected alternatives, as indicated by the ultrafeedback dataset.
Llama 3 Architecture: Benefits from the foundational capabilities of the Llama 3 8B base model.
Optimized for Quality: Training results show a validation loss of 0.5784 and specific DPO metrics like a margin mean of 88.5980, indicating effective preference learning.

Training Details

The model was trained for 1 epoch with a learning rate of 5e-07, using a total batch size of 128 across 4 GPUs. The optimizer used was AdamW with a cosine learning rate scheduler and a warmup ratio of 0.1. The training process involved Transformers 4.51.0, Pytorch 2.3.1+cu121, Datasets 2.21.0, and Tokenizers 0.21.4.

Good For

Applications requiring text generation that is aligned with human feedback.
Tasks where response quality and preference are critical.
Further fine-tuning or research into DPO-based model alignment.

Overview

Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)