W-61/llama-3-8b-base-epsilon-dpo-ultrafeedback-8xh200
W-61/llama-3-8b-base-epsilon-dpo-ultrafeedback-8xh200 is an 8 billion parameter language model fine-tuned by W-61, based on the Llama 3 architecture. This model is a DPO (Direct Preference Optimization) fine-tune of a Llama 3 base model, specifically optimized using the HuggingFaceH4/ultrafeedback_binarized dataset. It is designed to align with human preferences, achieving a rewards accuracy of 0.6905 on the evaluation set, making it suitable for tasks requiring high-quality, preference-aligned text generation.
Loading preview...
Overview
This model, W-61/llama-3-8b-base-epsilon-dpo-ultrafeedback-8xh200, is an 8 billion parameter language model developed by W-61. It is a fine-tuned variant of the W-61/llama-3-8b-base-sft-ultrachat-8xh200 model, specifically optimized using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset. The DPO training aims to align the model's outputs more closely with human preferences.
Key Characteristics
- Architecture: Llama 3 base model, fine-tuned.
- Parameter Count: 8 billion parameters.
- Context Length: 8192 tokens.
- Optimization Method: Direct Preference Optimization (DPO).
- Training Data: Fine-tuned on the
HuggingFaceH4/ultrafeedback_binarizeddataset.
Performance Metrics
During evaluation, the model achieved notable results:
- Loss: 0.6085
- Rewards/accuracies: 0.6905 (indicating a 69.05% accuracy in aligning with preferred responses)
- Rewards/margins: 0.2488
Training Details
The model was trained with a learning rate of 5e-07, a batch size of 4 (total effective batch size of 128 across 8 GPUs), and utilized a cosine learning rate scheduler with a 0.1 warmup ratio over 1 epoch. The training process used the AdamW optimizer.
Intended Use Cases
Given its DPO fine-tuning on a preference dataset, this model is well-suited for applications where generating high-quality, human-preferred responses is critical. This includes tasks such as:
- Instruction following
- Dialogue systems
- Content generation requiring nuanced understanding of preferences