W-61/llama-3-8b-base-new-dpo-hh-harmless-s_star0.6-4xh200-batch-64-20260421-213851
W-61/llama-3-8b-base-new-dpo-hh-harmless-s_star0.6-4xh200-batch-64-20260421-213851 is an 8 billion parameter language model, fine-tuned by W-61, based on a Llama 3 architecture. This model has been specifically optimized using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset to enhance harmlessness and align with human preferences. With an 8192 token context length, it is designed for applications requiring robust and safety-aligned text generation.
Loading preview...
Model Overview
This model, developed by W-61, is an 8 billion parameter language model built upon the Llama 3 architecture. It represents a fine-tuned iteration of the W-61/llama-3-8b-base-sft-hh-harmless-4xh200 base model.
Key Capabilities
- Harmlessness Optimization: The model has undergone Direct Preference Optimization (DPO) using the Anthropic/hh-rlhf dataset, specifically targeting improved harmlessness and alignment with human feedback.
- Performance Metrics: During training, it achieved a final validation loss of 0.5427, with specific DPO metrics indicating preference alignment, such as a margin DPO mean of 46.9651.
- Context Length: It supports an 8192 token context window, suitable for processing moderately long inputs and generating coherent responses.
Training Details
The model was trained for 1 epoch with a learning rate of 5e-07, using a total batch size of 64 across 4 GPUs. The training utilized the AdamW optimizer and a cosine learning rate scheduler with a 0.1 warmup ratio.
Intended Use Cases
This model is particularly well-suited for applications where generating safe, harmless, and aligned text is a primary concern. Its DPO fine-tuning on a human preference dataset makes it a strong candidate for conversational AI, content moderation, and other tasks requiring careful output generation.