tsavage68/chat_1000STEPS_1e6_03beta_DPO

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Feb 15, 2024Architecture:Transformer Cold

The tsavage68/chat_1000STEPS_1e6_03beta_DPO is a 7 billion parameter language model fine-tuned from Meta's Llama-2-7b-chat-hf. This model was trained using Direct Preference Optimization (DPO) over 1000 steps, achieving a rewards/accuracies of 0.5363 on its evaluation set. It is designed for chat-based applications, leveraging its Llama-2 foundation and DPO training to align with human preferences.

Loading preview...

Overview

tsavage68/chat_1000STEPS_1e6_03beta_DPO is a 7 billion parameter language model, fine-tuned from the meta-llama/Llama-2-7b-chat-hf base model. It was developed by tsavage68 and trained using Direct Preference Optimization (DPO) over 1000 steps, with a learning rate of 1e-06 and a total batch size of 8. The model's training aimed to align its responses with human preferences, as indicated by its DPO-specific evaluation metrics.

Key Training Details

  • Base Model: meta-llama/Llama-2-7b-chat-hf
  • Training Method: Direct Preference Optimization (DPO)
  • Training Steps: 1000
  • Learning Rate: 1e-06
  • Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • Evaluation Metrics: Achieved rewards/accuracies of 0.5363, rewards/margins of 0.2144, and a final loss of 0.6804 on the evaluation set.

Intended Use Cases

This model is primarily suited for chat-based applications where preference alignment is crucial. Its DPO training suggests an optimization for generating responses that are preferred over alternatives, making it potentially useful for conversational AI, dialogue systems, and interactive agents. Developers should consider its Llama-2 foundation for general language understanding and generation tasks, enhanced by the DPO fine-tuning for improved response quality based on preferences.