jackf857/llama-3-8b-base-new-dpo-hh-harmless-4xh200-batch-64-q_t-0.5-s_star-0.85
The jackf857/llama-3-8b-base-new-dpo-hh-harmless-4xh200-batch-64-q_t-0.5-s_star-0.85 model is an 8 billion parameter Llama 3-based language model, fine-tuned by jackf857. It is specifically optimized using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset to enhance harmlessness and align with human preferences. This model is primarily intended for applications requiring a robust, preference-aligned LLM with a focus on generating safe and helpful responses, making it suitable for conversational AI and content moderation tasks.
Loading preview...
Model Overview
This model, llama-3-8b-base-new-dpo-hh-harmless-4xh200-batch-64-q_t-0.5-s_star-0.85, is an 8 billion parameter Llama 3-based language model. It has been fine-tuned by jackf857 using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, building upon the W-61/llama-3-8b-base-sft-hh-harmless-4xh200 base model. The primary goal of this fine-tuning was to improve the model's harmlessness and alignment with human preferences, as indicated by its training on a dataset focused on helpful and harmless responses.
Key Capabilities
- Preference Alignment: Optimized using DPO to align with human preferences, particularly for harmlessness.
- Safety-Focused Generation: Designed to produce responses that are helpful and avoid harmful content.
- Llama 3 Architecture: Benefits from the foundational capabilities of the Llama 3 8B base model.
Good For
- Conversational AI: Developing chatbots or virtual assistants where safety and harmlessness are critical.
- Content Moderation: Assisting in filtering or generating content that adheres to safety guidelines.
- Research in Alignment: Exploring DPO techniques for improving LLM behavior and safety.
Training Details
The model was trained with a learning rate of 5e-07, a total batch size of 64, and for 1 epoch. Evaluation metrics show a final validation loss of 0.5392, with specific DPO metrics like Margin Dpo/margin Mean at 4.1973, indicating effective preference learning.