jackf857/llama-3-8b-base-new-dpo-harmless-4xh200-s_star1.0
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 21, 2026Architecture:Transformer Cold
The jackf857/llama-3-8b-base-new-dpo-harmless-4xh200-s_star1.0 model is an 8 billion parameter language model, fine-tuned from W-61/llama-3-8b-base-sft-hh-harmless-4xh200. It was further optimized using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, focusing on harmlessness and alignment. This model is designed for applications requiring a robust and safety-aligned Llama 3 base, particularly in scenarios where mitigating harmful outputs is critical.
Loading preview...
Model Overview
This model, jackf857/llama-3-8b-base-new-dpo-harmless-4xh200-s_star1.0, is an 8 billion parameter language model derived from the Llama 3 architecture. It represents a fine-tuned iteration of the W-61/llama-3-8b-base-sft-hh-harmless-4xh200 base model.
Key Capabilities
- Harmlessness Alignment: The model has undergone further Direct Preference Optimization (DPO) using the
Anthropic/hh-rlhfdataset, specifically targeting the reduction of harmful outputs and improving alignment with human preferences. - Performance Metrics: During training, it achieved a final validation loss of 0.5214, with specific DPO metrics indicating a margin mean of 11.8756 and a chosen log-probability of -96.2474, suggesting effective preference learning.
- Training Configuration: Trained for 1 epoch with a learning rate of 5e-07, a total batch size of 64, and utilizing 4 GPUs, ensuring a focused and efficient optimization process.
Good For
- Safety-Critical Applications: Ideal for use cases where generating harmless and aligned text is a primary concern, such as content moderation, safe AI assistants, or educational tools.
- Further Fine-tuning: Serves as a strong, safety-aligned base model for subsequent fine-tuning on more specific, domain-adapted tasks where a foundation of harmlessness is desired.
- Research in Alignment: Useful for researchers exploring DPO techniques and the impact of the Anthropic/hh-rlhf dataset on model behavior and safety.