tsavage68/chat_1000STEPS_1e6rate_01beta_DPO

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Feb 14, 2024Architecture:Transformer Cold

The tsavage68/chat_1000STEPS_1e6rate_01beta_DPO model is a 7 billion parameter language model fine-tuned from meta-llama/Llama-2-7b-chat-hf. This model was trained using a Direct Preference Optimization (DPO) approach over 1000 steps with a learning rate of 1e-06. It demonstrates specific reward and log-probability metrics from its DPO training, indicating its preference alignment.

Loading preview...

Model Overview

The tsavage68/chat_1000STEPS_1e6rate_01beta_DPO is a 7 billion parameter language model derived from the meta-llama/Llama-2-7b-chat-hf base model. It has undergone a fine-tuning process using Direct Preference Optimization (DPO) over 1000 training steps.

Training Details

This model was trained with a learning rate of 1e-06, a batch size of 4 (accumulated to 8), and an Adam optimizer. The training procedure involved 1000 steps, resulting in specific evaluation metrics:

  • Loss: 0.6684
  • Rewards/chosen: -0.3437
  • Rewards/rejected: -0.4414
  • Rewards/accuracies: 0.5055
  • Rewards/margins: 0.0978

These metrics reflect the model's performance in distinguishing between preferred ('chosen') and dispreferred ('rejected') responses during DPO training.

Intended Uses & Limitations

As the model is a DPO fine-tune of Llama-2-7b-chat-hf, it is likely intended for conversational AI applications where preference alignment is crucial. However, specific intended uses, limitations, and details about the training dataset are not explicitly provided in the model card.