W-61/llama-3-8b-base-new-dpo-hh-harmless-s_star0.85-4xh200-batch-64-20260421-213851

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 21, 2026Architecture:Transformer Cold

W-61/llama-3-8b-base-new-dpo-hh-harmless-s_star0.85-4xh200-batch-64-20260421-213851 is an 8 billion parameter Llama 3 base model fine-tuned using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset. This model is specifically optimized for harmlessness and aligns with human preferences for helpful and harmless responses. It is suitable for applications requiring robust safety and ethical AI behavior, building upon its Llama 3 foundation.

Loading preview...

Model Overview

This model, llama-3-8b-base-new-dpo-hh-harmless-s_star0.85-4xh200-batch-64-20260421-213851, is an 8 billion parameter language model derived from the Llama 3 architecture. It has been fine-tuned using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, which is designed to align models with human preferences for helpfulness and harmlessness.

Key Characteristics

  • Base Model: Fine-tuned from W-61/llama-3-8b-base-sft-hh-harmless-4xh200.
  • Fine-tuning Method: Utilizes Direct Preference Optimization (DPO) for alignment.
  • Dataset: Trained on the Anthropic/hh-rlhf dataset, focusing on harmlessness.
  • Performance: Achieved a final validation loss of 0.5165, with a DPO margin mean of 14.9295, indicating effective preference learning.

Training Details

The model underwent a single epoch of training with a learning rate of 5e-07, a total batch size of 64, and an AdamW optimizer. The training process involved 4 GPUs with gradient accumulation steps of 2.

Intended Use Cases

This model is particularly well-suited for applications where generating safe, ethical, and non-toxic responses is paramount. Its DPO fine-tuning on a harmlessness dataset makes it a strong candidate for content moderation, safe AI assistants, and other scenarios requiring robust ethical alignment.