jackf857/llama-3-8b-base-new-dpo-hh-helpful-s_star0.6-4xh200-batch-64-20260421-214335-rerun

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 21, 2026Architecture:Transformer Cold

The jackf857/llama-3-8b-base-new-dpo-hh-helpful-s_star0.6-4xh200-batch-64-20260421-214335-rerun is an 8 billion parameter Llama 3 base model fine-tuned using Direct Preference Optimization (DPO). It is specifically optimized for helpfulness, having been trained on the Anthropic/hh-rlhf dataset. This model is designed to generate more helpful and aligned responses compared to its base counterpart.

Loading preview...

Model Overview

This model, jackf857/llama-3-8b-base-new-dpo-hh-helpful-s_star0.6-4xh200-batch-64-20260421-214335-rerun, is an 8 billion parameter Llama 3 variant. It has been fine-tuned from the W-61/llama-3-8b-base-sft-hh-helpful-4xh200 model using Direct Preference Optimization (DPO).

Key Characteristics

  • Base Model: Llama 3 8B.
  • Fine-tuning Method: Direct Preference Optimization (DPO).
  • Training Data: Primarily fine-tuned on the Anthropic/hh-rlhf dataset, which focuses on human feedback for helpfulness.
  • Optimization Goal: Enhanced helpfulness and alignment in generated responses.

Training Details

The model underwent 1 epoch of training with a learning rate of 5e-07, a train_batch_size of 8, and a gradient_accumulation_steps of 2, resulting in a total_train_batch_size of 64. The optimizer used was AdamW_Torch with cosine learning rate scheduling. Evaluation metrics show a final loss of 0.5572 and a mean DPO margin of 156.2329, indicating successful preference learning.

Intended Use Cases

This model is suitable for applications requiring helpful and aligned text generation, particularly in scenarios where human preference for assistance is paramount. Its DPO fine-tuning on the hh-rlhf dataset suggests strong performance in conversational agents, content generation, and question-answering systems where helpfulness is a key metric.