W-61/llama-3-8b-base-epsilon-dpo-ultrafeedback-8xh200

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 11, 2026Architecture:Transformer Cold

W-61/llama-3-8b-base-epsilon-dpo-ultrafeedback-8xh200 is an 8 billion parameter language model fine-tuned by W-61, based on the Llama 3 architecture. This model is a DPO (Direct Preference Optimization) fine-tune of a Llama 3 base model, specifically optimized using the HuggingFaceH4/ultrafeedback_binarized dataset. It is designed to align with human preferences, achieving a rewards accuracy of 0.6905 on the evaluation set, making it suitable for tasks requiring high-quality, preference-aligned text generation.

Loading preview...

Overview

This model, W-61/llama-3-8b-base-epsilon-dpo-ultrafeedback-8xh200, is an 8 billion parameter language model developed by W-61. It is a fine-tuned variant of the W-61/llama-3-8b-base-sft-ultrachat-8xh200 model, specifically optimized using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset. The DPO training aims to align the model's outputs more closely with human preferences.

Key Characteristics

  • Architecture: Llama 3 base model, fine-tuned.
  • Parameter Count: 8 billion parameters.
  • Context Length: 8192 tokens.
  • Optimization Method: Direct Preference Optimization (DPO).
  • Training Data: Fine-tuned on the HuggingFaceH4/ultrafeedback_binarized dataset.

Performance Metrics

During evaluation, the model achieved notable results:

  • Loss: 0.6085
  • Rewards/accuracies: 0.6905 (indicating a 69.05% accuracy in aligning with preferred responses)
  • Rewards/margins: 0.2488

Training Details

The model was trained with a learning rate of 5e-07, a batch size of 4 (total effective batch size of 128 across 8 GPUs), and utilized a cosine learning rate scheduler with a 0.1 warmup ratio over 1 epoch. The training process used the AdamW optimizer.

Intended Use Cases

Given its DPO fine-tuning on a preference dataset, this model is well-suited for applications where generating high-quality, human-preferred responses is critical. This includes tasks such as:

  • Instruction following
  • Dialogue systems
  • Content generation requiring nuanced understanding of preferences