kmseong/llama2_7b_chat-SSFT-MMLU-FT-SafeInstr-0.1-lr3e-5_2
The kmseong/llama2_7b_chat-SSFT-MMLU-FT-SafeInstr-0.1-lr3e-5_2 model is a 7 billion parameter Llama 2 based language model, fine-tuned for safety alignment using a 3-phase Safety-First WaRP (Weight space Rotation Process) pipeline. This model focuses on maintaining refusal capabilities for harmful requests while improving utility on reasoning tasks. It is designed to balance safety and performance, making it suitable for applications requiring robust content moderation and reliable responses.
Loading preview...
Model Overview
This model, kmseong/llama2_7b_chat-SSFT-MMLU-FT-SafeInstr-0.1-lr3e-5_2, is a 7 billion parameter Llama 2 based language model that has undergone a specialized fine-tuning process for safety alignment. It leverages a unique 3-phase Safety-First WaRP (Weight space Rotation Process) pipeline to enhance its ability to handle harmful requests while preserving its utility for general tasks.
Key Capabilities
- Enhanced Safety Alignment: Utilizes a novel WaRP method to protect safety mechanisms through gradient masking, ensuring the model maintains refusal capabilities for harmful content.
- Balanced Safety-Utility Tradeoff: Designed to improve performance on reasoning tasks (like GSM8K) while safeguarding its safety features, offering a more balanced output compared to models focused solely on one aspect.
- Targeted Fine-tuning: The training procedure involved constructing an orthonormal basis from safety data, scoring neuron importance, and then incrementally learning utility tasks with gradient masking to protect critical safety directions.
Good For
- Applications requiring a strong emphasis on content moderation and safe AI interactions.
- Use cases where maintaining refusal capabilities for inappropriate prompts is crucial.
- Scenarios demanding a balance between model utility and robust safety mechanisms.