W-61/llama-3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.3-20260428-045924
W-61/llama-3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.3-20260428-045924 is an 8 billion parameter language model fine-tuned from W-61/llama-3-8b-base-sft-ultrachat-8xh200. This model was fine-tuned using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset. It is designed to align with human preferences, making it suitable for conversational AI and instruction-following tasks.
Loading preview...
Model Overview
This model, W-61/llama-3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.3-20260428-045924, is an 8 billion parameter language model. It is a fine-tuned variant of W-61/llama-3-8b-base-sft-ultrachat-8xh200, specifically optimized using Direct Preference Optimization (DPO).
Key Training Details
- Base Model: Fine-tuned from
W-61/llama-3-8b-base-sft-ultrachat-8xh200. - Fine-tuning Method: Direct Preference Optimization (DPO).
- Dataset: Trained on the
HuggingFaceH4/ultrafeedback_binarizeddataset. - Training Hyperparameters:
- Learning Rate: 5e-07
- Optimizer: AdamW with betas=(0.9, 0.999) and epsilon=1e-08
- Batch Size: 4 (train), 2 (eval) with 8 gradient accumulation steps, totaling 128 (train) and 8 (eval) total batch size.
- Epochs: 1
- Evaluation Performance: Achieved a validation loss of 0.6247 on the evaluation set, with specific DPO metrics indicating preference alignment.
Intended Use Cases
This model is primarily intended for applications requiring strong instruction following and human preference alignment, such as:
- Conversational AI: Generating coherent and contextually relevant responses in dialogue systems.
- Instruction Following: Executing complex instructions and producing desired outputs based on user prompts.
- Preference-aligned Generation: Creating content that is preferred by human evaluators, leveraging its DPO fine-tuning.