Name: tsavage68/chat_1000STEPS_1e6_03beta_DPO API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: tsavage68

Overview

tsavage68/chat_1000STEPS_1e6_03beta_DPO is a 7 billion parameter language model, fine-tuned from the meta-llama/Llama-2-7b-chat-hf base model. It was developed by tsavage68 and trained using Direct Preference Optimization (DPO) over 1000 steps, with a learning rate of 1e-06 and a total batch size of 8. The model's training aimed to align its responses with human preferences, as indicated by its DPO-specific evaluation metrics.

Key Training Details

Base Model: meta-llama/Llama-2-7b-chat-hf
Training Method: Direct Preference Optimization (DPO)
Training Steps: 1000
Learning Rate: 1e-06
Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
Evaluation Metrics: Achieved rewards/accuracies of 0.5363, rewards/margins of 0.2144, and a final loss of 0.6804 on the evaluation set.

Intended Use Cases

This model is primarily suited for chat-based applications where preference alignment is crucial. Its DPO training suggests an optimization for generating responses that are preferred over alternatives, making it potentially useful for conversational AI, dialogue systems, and interactive agents. Developers should consider its Llama-2 foundation for general language understanding and generation tasks, enhanced by the DPO fine-tuning for improved response quality based on preferences.

Overview

Overview

Key Training Details

Intended Use Cases

Full Model Card (README)