kmseong/llama2_7b_chat-arc-c-WaRP-lr5e-5
The kmseong/llama2_7b_chat-arc-c-WaRP-lr5e-5 model is a 7 billion parameter Llama 3.1 8B Instruct fine-tune by kmseong, specifically aligned for safety using the Weight space Rotation Process (WaRP). This model focuses on maintaining refusal capabilities for harmful requests while improving utility on reasoning tasks, balancing safety and performance. It is designed for applications requiring robust safety mechanisms without significant degradation in general task performance.
Loading preview...
Overview
This model, kmseong/WaRP-Safety-Llama3_8B_Instruct, is a fine-tuned version of meta-llama/Llama-3.1-8B-Instruct developed by kmseong. Its primary innovation lies in its safety alignment achieved through a 3-phase Weight space Rotation Process (WaRP) pipeline. The training procedure involved constructing an orthonormal basis from FFN layer activations using safety data, scoring neuron importance with gradient-based methods, and then incrementally learning utility tasks (like GSM8K) while protecting these identified safety-critical directions via gradient masking.
Key Capabilities
- Enhanced Safety Alignment: Utilizes a novel WaRP method to embed safety mechanisms directly into the model's weight space.
- Harmful Request Refusal: Designed to maintain strong refusal capabilities for inappropriate or harmful queries.
- Balanced Utility: Improves performance on reasoning tasks (e.g., GSM8K) while preserving safety, aiming for an optimal safety-utility tradeoff.
- Gradient Masking: Employs gradient masking during fine-tuning to protect important safety-related directions.
Good For
- Applications requiring a robustly safety-aligned LLM.
- Use cases where maintaining refusal for harmful content is critical.
- Scenarios needing a balance between general utility and strong safety features.