Name: tsavage68/chat_1000STEPS_1e6_05beta_DPO API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: tsavage68

Model Overview

The tsavage68/chat_1000STEPS_1e6_05beta_DPO is a 7 billion parameter language model derived from the meta-llama/Llama-2-7b-chat-hf base. It has undergone fine-tuning using a Direct Preference Optimization (DPO) method over 1000 training steps, aiming to align its outputs with human preferences.

Training Highlights

Base Model: Meta Llama-2-7b-chat-hf
Optimization Method: Direct Preference Optimization (DPO)
Key Metrics: Achieved a reward accuracy of 53.19% on the evaluation set, with a chosen reward of -0.5484 and a rejected reward of -0.8442, indicating a margin of 0.2958.
Hyperparameters: Training utilized a learning rate of 1e-06, a batch size of 4 (total 8 with accumulation), and an Adam optimizer with cosine learning rate scheduling.

Potential Use Cases

This model is likely suitable for applications requiring a preference-aligned chat experience, building on the conversational capabilities of the Llama 2 base. Its DPO training suggests an emphasis on generating responses that are preferred over alternatives, making it potentially useful for interactive agents or dialogue systems where response quality and alignment are important.

Overview

Model Overview

Training Highlights

Potential Use Cases

Full Model Card (README)