tsavage68/chat_1000STEPS_1e6rate_01beta_DPO
The tsavage68/chat_1000STEPS_1e6rate_01beta_DPO model is a 7 billion parameter language model fine-tuned from meta-llama/Llama-2-7b-chat-hf. This model was trained using a Direct Preference Optimization (DPO) approach over 1000 steps with a learning rate of 1e-06. It demonstrates specific reward and log-probability metrics from its DPO training, indicating its preference alignment.
Loading preview...
Model Overview
The tsavage68/chat_1000STEPS_1e6rate_01beta_DPO is a 7 billion parameter language model derived from the meta-llama/Llama-2-7b-chat-hf base model. It has undergone a fine-tuning process using Direct Preference Optimization (DPO) over 1000 training steps.
Training Details
This model was trained with a learning rate of 1e-06, a batch size of 4 (accumulated to 8), and an Adam optimizer. The training procedure involved 1000 steps, resulting in specific evaluation metrics:
- Loss: 0.6684
- Rewards/chosen: -0.3437
- Rewards/rejected: -0.4414
- Rewards/accuracies: 0.5055
- Rewards/margins: 0.0978
These metrics reflect the model's performance in distinguishing between preferred ('chosen') and dispreferred ('rejected') responses during DPO training.
Intended Uses & Limitations
As the model is a DPO fine-tune of Llama-2-7b-chat-hf, it is likely intended for conversational AI applications where preference alignment is crucial. However, specific intended uses, limitations, and details about the training dataset are not explicitly provided in the model card.