Name: jackf857/llama-3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.4-s_star-0.5 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: jackf857

Model Overview

This model, jackf857/llama-3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.4-s_star-0.5, is an 8 billion parameter language model derived from W-61/llama-3-8b-base-sft-ultrachat-8xh200. It has been fine-tuned using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset, aiming to enhance its ability to generate responses that align with human preferences.

Training Details

The model underwent a single epoch of DPO training with a learning rate of 5e-07 and a total effective batch size of 128. Key training metrics include a final loss of 0.5929 and a DPO margin mean of 96.9446, indicating its performance in distinguishing between preferred and rejected responses. The training utilized 4 GPUs with a gradient accumulation of 8 steps.

Potential Use Cases

Given its DPO fine-tuning on a preference dataset, this model is likely well-suited for applications where:

Preference-aligned response generation is critical.
Instruction following and generating helpful, harmless, and honest outputs are desired.
Refining outputs based on implicit human feedback is beneficial.

Limitations

The model card indicates that more information is needed regarding its specific intended uses, limitations, and detailed training/evaluation data. Users should conduct further evaluation for their specific applications.

Overview

Model Overview

Training Details

Potential Use Cases

Limitations

Full Model Card (README)