jackf857/llama-3-8b-base-new-dpo-hh-harmless-4xh200-batch-64-q_t-0.5-s_star-0.6
This model is an 8 billion parameter Llama 3 base model, fine-tuned by jackf857 using DPO on the Anthropic/hh-rlhf dataset. It is specifically optimized for harmlessness and alignment, building upon a pre-trained SFT version. The model is designed for applications requiring robust safety and reduced harmful outputs, making it suitable for general-purpose conversational AI where ethical considerations are paramount.
Loading preview...
Model Overview
This model, llama-3-8b-base-new-dpo-hh-harmless-4xh200-batch-64-q_t-0.5-s_star-0.6, is an 8 billion parameter Llama 3 base model. It has been fine-tuned using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, which is known for its focus on harmlessness and helpfulness. This DPO fine-tuning builds upon an initial Supervised Fine-Tuning (SFT) phase from the W-61/llama-3-8b-base-sft-hh-harmless-4xh200 model.
Key Characteristics
- Base Model: Llama 3 8B parameters.
- Fine-tuning Method: Direct Preference Optimization (DPO).
- Dataset: Anthropic/hh-rlhf, emphasizing harmlessness.
- Context Length: 8192 tokens.
- Training Details: Trained with a learning rate of 5e-07, a total batch size of 64, and a cosine learning rate scheduler over 1 epoch.
Performance Metrics
During training, the model achieved a final validation loss of 0.5318. Notable DPO-specific metrics include a Margin Dpo/margin Mean of 34.3262 and Logps/chosen of -139.1072, indicating successful preference learning towards chosen (harmless) responses over rejected ones.
Intended Use Cases
This model is particularly well-suited for applications where generating harmless and aligned text is critical. It can be used for:
- Safe conversational AI: Developing chatbots or virtual assistants that prioritize ethical and non-toxic responses.
- Content moderation assistance: Helping to filter or flag potentially harmful content.
- Research into AI safety and alignment: Providing a base for further experimentation in reducing undesirable model behaviors.