vukien2301/llama-3.1-8b-ultrafeedback-dpo-from-epoch1

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:May 9, 2026Architecture:Transformer Warm

The vukien2301/llama-3.1-8b-ultrafeedback-dpo-from-epoch1 is an 8 billion parameter language model, fine-tuned using Direct Preference Optimization (DPO) on the pvdhihihi/ultra-feedback dataset. This model is based on a Llama 3.2 architecture and was trained for one epoch with a 32768 token context length. It is designed for tasks benefiting from preference-based fine-tuning, aiming to align with human preferences.

Loading preview...

Model Overview

The vukien2301/llama-3.1-8b-ultrafeedback-dpo-from-epoch1 is an 8 billion parameter language model, fine-tuned using Direct Preference Optimization (DPO). It is built upon a Llama 3.2 base architecture and leverages the pvdhihihi/ultra-feedback dataset for its DPO training.

Key Training Details

  • Base Model: Derived from /home/minchan.kwon/ADPA/model/llama3.2-1b-deita-dpomix/ref_teacher_3epochs/checkpoint-191.
  • Fine-tuning Method: Direct Preference Optimization (DPO).
  • Dataset: pvdhihihi/ultra-feedback.
  • Epochs: Trained for 1 epoch.
  • Learning Rate: 7e-07.
  • Batch Size: A train_batch_size of 32 and eval_batch_size of 8, with a total_train_batch_size of 256 across 8 GPUs.
  • Optimizer: AdamW with default betas and epsilon.
  • Context Length: Supports a context length of 32768 tokens.

Intended Use

This model is primarily intended for applications where alignment with human preferences, as learned through DPO from feedback datasets, is crucial. Its DPO fine-tuning suggests suitability for tasks requiring nuanced response generation and adherence to preferred conversational styles or content quality.