vukien2301/llama-3.1-8b-ultrafeedback-dpo-from-epoch1
The vukien2301/llama-3.1-8b-ultrafeedback-dpo-from-epoch1 is an 8 billion parameter language model, fine-tuned using Direct Preference Optimization (DPO) on the pvdhihihi/ultra-feedback dataset. This model is based on a Llama 3.2 architecture and was trained for one epoch with a 32768 token context length. It is designed for tasks benefiting from preference-based fine-tuning, aiming to align with human preferences.
Loading preview...
Model Overview
The vukien2301/llama-3.1-8b-ultrafeedback-dpo-from-epoch1 is an 8 billion parameter language model, fine-tuned using Direct Preference Optimization (DPO). It is built upon a Llama 3.2 base architecture and leverages the pvdhihihi/ultra-feedback dataset for its DPO training.
Key Training Details
- Base Model: Derived from
/home/minchan.kwon/ADPA/model/llama3.2-1b-deita-dpomix/ref_teacher_3epochs/checkpoint-191. - Fine-tuning Method: Direct Preference Optimization (DPO).
- Dataset:
pvdhihihi/ultra-feedback. - Epochs: Trained for 1 epoch.
- Learning Rate: 7e-07.
- Batch Size: A
train_batch_sizeof 32 andeval_batch_sizeof 8, with atotal_train_batch_sizeof 256 across 8 GPUs. - Optimizer: AdamW with default betas and epsilon.
- Context Length: Supports a context length of 32768 tokens.
Intended Use
This model is primarily intended for applications where alignment with human preferences, as learned through DPO from feedback datasets, is crucial. Its DPO fine-tuning suggests suitability for tasks requiring nuanced response generation and adherence to preferred conversational styles or content quality.