jackf857/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.5-s_star-0.6

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 25, 2026Architecture:Transformer Cold

The jackf857/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.5-s_star-0.6 is an 8 billion parameter Llama 3 base model fine-tuned by jackf857 using Direct Preference Optimization (DPO). It is specifically optimized for helpfulness, having been trained on the Anthropic/hh-rlhf dataset. This model is designed for applications requiring helpful and aligned text generation.

Loading preview...

Model Overview

This model, llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.5-s_star-0.6, is an 8 billion parameter Llama 3 base model. It has been fine-tuned by jackf857 using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, which is known for aligning models towards helpfulness.

Key Training Details

  • Base Model: Fine-tuned from W-61/llama-3-8b-base-sft-hh-helpful-4xh200.
  • Optimization Method: Direct Preference Optimization (DPO).
  • Dataset: Anthropic/hh-rlhf, indicating a focus on helpful and harmless responses.
  • Training Hyperparameters:
    • Learning Rate: 5e-07
    • Optimizer: ADAMW_TORCH
    • Epochs: 1
    • Total Train Batch Size: 64 (across 4 GPUs)

Performance Metrics

During training, the model achieved a final validation loss of 0.5691. Key DPO-specific metrics include a margin mean of 152.7355 and a beta of 0.0027, reflecting the preference learning process.

Intended Use Cases

Given its DPO fine-tuning on a helpfulness dataset, this model is primarily intended for applications where generating helpful, aligned, and preference-tuned text is crucial. It is suitable for tasks requiring responses that adhere to human preferences for assistance and utility.