jackf857/llama-3-8b-base-new-dpo-hh-helpful-s_star0.85-4xh200-batch-64-20260421-233802

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 22, 2026Architecture:Transformer Cold

The jackf857/llama-3-8b-base-new-dpo-hh-helpful-s_star0.85-4xh200-batch-64-20260421-233802 model is an 8 billion parameter language model, fine-tuned by jackf857 from the W-61/llama-3-8b-base-sft-hh-helpful-4xh200 base model. It has been optimized using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, aiming to enhance helpfulness and align with human preferences. This model is designed for general-purpose conversational AI tasks where helpful and aligned responses are critical.

Loading preview...

Model Overview

This model, developed by jackf857, is an 8 billion parameter language model based on the Llama 3 architecture. It is a fine-tuned variant of the W-61/llama-3-8b-base-sft-hh-helpful-4xh200 model, specifically optimized using Direct Preference Optimization (DPO).

Key Characteristics

  • Base Model: Llama 3 8B.
  • Fine-tuning: Utilizes Direct Preference Optimization (DPO) for alignment.
  • Dataset: Trained on the Anthropic/hh-rlhf dataset, which focuses on human feedback for helpfulness and harmlessness.
  • Context Length: Supports a context window of 8192 tokens.

Training Details

The model underwent a single epoch of training with a learning rate of 5e-07 and a total batch size of 64 across 4 GPUs. Evaluation metrics show a final loss of 0.5131, with DPO-specific metrics indicating preference alignment, such as a margin DPO mean of 139.8486. This fine-tuning process aims to improve the model's ability to generate responses that are perceived as more helpful and aligned with human instructions.

Intended Use Cases

This model is suitable for applications requiring a helpful and preference-aligned language model, particularly in conversational AI, instruction following, and general text generation where human-like helpfulness is desired.