Model Overview

This model, llama-3-8b-base-new-dpo-hh-harmless-s_star0.85-4xh200-batch-64-20260421-213851, is an 8 billion parameter language model derived from the Llama 3 architecture. It has been fine-tuned using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, which is designed to align models with human preferences for helpfulness and harmlessness.

Key Characteristics

Base Model: Fine-tuned from W-61/llama-3-8b-base-sft-hh-harmless-4xh200.
Fine-tuning Method: Utilizes Direct Preference Optimization (DPO) for alignment.
Dataset: Trained on the Anthropic/hh-rlhf dataset, focusing on harmlessness.
Performance: Achieved a final validation loss of 0.5165, with a DPO margin mean of 14.9295, indicating effective preference learning.

Training Details

The model underwent a single epoch of training with a learning rate of 5e-07, a total batch size of 64, and an AdamW optimizer. The training process involved 4 GPUs with gradient accumulation steps of 2.

Intended Use Cases

This model is particularly well-suited for applications where generating safe, ethical, and non-toxic responses is paramount. Its DPO fine-tuning on a harmlessness dataset makes it a strong candidate for content moderation, safe AI assistants, and other scenarios requiring robust ethical alignment.

Overview

Model Overview

Key Characteristics

Training Details

Intended Use Cases

Full Model Card (README)