W-61/llama-3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.35-20260428-045924

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 28, 2026Architecture:Transformer Cold

W-61/llama-3-8b-base-new-dpo-ultrafeedback-4xh200-batch-128-q_t-0.45-s_star-0.35-20260428-045924 is an 8 billion parameter language model, fine-tuned by W-61 using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset. This model is a DPO-tuned variant of W-61/llama-3-8b-base-sft-ultrachat-8xh200, designed to align with human preferences. It features an 8192 token context length and is optimized for generating responses that are preferred by human evaluators.

Loading preview...

Overview

This model, developed by W-61, is an 8 billion parameter language model fine-tuned using Direct Preference Optimization (DPO). It is based on the W-61/llama-3-8b-base-sft-ultrachat-8xh200 model and was trained on the HuggingFaceH4/ultrafeedback_binarized dataset.

Key Training Details

  • Fine-tuning Method: Direct Preference Optimization (DPO)
  • Base Model: W-61/llama-3-8b-base-sft-ultrachat-8xh200
  • Dataset: HuggingFaceH4/ultrafeedback_binarized
  • Context Length: 8192 tokens
  • Training Hyperparameters:
    • Learning Rate: 5e-07
    • Optimizer: ADAMW_TORCH
    • Epochs: 1
    • Total Train Batch Size: 128

Performance Metrics

During training, the model achieved a final validation loss of 0.5985. Key DPO-specific metrics include a margin mean of 99.3391 and a KL/mean of -301.0162, indicating its performance in aligning with human preferences as captured by the DPO objective.

Intended Use Cases

While specific intended uses and limitations are not detailed in the README, models fine-tuned with DPO on preference datasets are generally suitable for tasks requiring high-quality, human-aligned text generation, such as instruction following, dialogue, and content creation where user preference is a key factor.