W-61/llama-3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-s_star-0.4-20260425-111846

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 27, 2026Architecture:Transformer Cold

W-61/llama-3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-s_star-0.4-20260425-111846 is an 8 billion parameter Llama 3 base model fine-tuned by W-61 using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset. This model is optimized for generating preferred responses based on human feedback, demonstrating improved alignment and response quality. It is suitable for applications requiring nuanced and human-aligned text generation.

Loading preview...

Overview

This model, developed by W-61, is an 8 billion parameter Llama 3 base model that has undergone fine-tuning using Direct Preference Optimization (DPO). The training utilized the HuggingFaceH4/ultrafeedback_binarized dataset, aiming to align the model's outputs more closely with human preferences.

Key Capabilities

  • Preference-aligned Generation: Fine-tuned with DPO to produce responses that are preferred over rejected alternatives, as indicated by the ultrafeedback dataset.
  • Llama 3 Architecture: Benefits from the foundational capabilities of the Llama 3 8B base model.
  • Optimized for Quality: Training results show a validation loss of 0.5784 and specific DPO metrics like a margin mean of 88.5980, indicating effective preference learning.

Training Details

The model was trained for 1 epoch with a learning rate of 5e-07, using a total batch size of 128 across 4 GPUs. The optimizer used was AdamW with a cosine learning rate scheduler and a warmup ratio of 0.1. The training process involved Transformers 4.51.0, Pytorch 2.3.1+cu121, Datasets 2.21.0, and Tokenizers 0.21.4.

Good For

  • Applications requiring text generation that is aligned with human feedback.
  • Tasks where response quality and preference are critical.
  • Further fine-tuning or research into DPO-based model alignment.