Model Overview

This model, llama-3-8b-base-new-dpo-hh-harmless-4xh200-batch-64-q_t-0.5-s_star-0.6, is an 8 billion parameter Llama 3 base model. It has been fine-tuned using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, which is known for its focus on harmlessness and helpfulness. This DPO fine-tuning builds upon an initial Supervised Fine-Tuning (SFT) phase from the W-61/llama-3-8b-base-sft-hh-harmless-4xh200 model.

Key Characteristics

Base Model: Llama 3 8B parameters.
Fine-tuning Method: Direct Preference Optimization (DPO).
Dataset: Anthropic/hh-rlhf, emphasizing harmlessness.
Context Length: 8192 tokens.
Training Details: Trained with a learning rate of 5e-07, a total batch size of 64, and a cosine learning rate scheduler over 1 epoch.

Performance Metrics

During training, the model achieved a final validation loss of 0.5318. Notable DPO-specific metrics include a Margin Dpo/margin Mean of 34.3262 and Logps/chosen of -139.1072, indicating successful preference learning towards chosen (harmless) responses over rejected ones.

Intended Use Cases

This model is particularly well-suited for applications where generating harmless and aligned text is critical. It can be used for:

Safe conversational AI: Developing chatbots or virtual assistants that prioritize ethical and non-toxic responses.
Content moderation assistance: Helping to filter or flag potentially harmful content.
Research into AI safety and alignment: Providing a base for further experimentation in reducing undesirable model behaviors.

Overview

Model Overview

Key Characteristics

Performance Metrics

Intended Use Cases

Full Model Card (README)