kmseong/llama2_7b_chat-SSFT-MMLU-FT-lr3e-5
The kmseong/WaRP-Safety-Llama3_8B_Instruct is an 8 billion parameter Llama 3.1 Instruct model developed by Min-Seong Kim, fine-tuned for safety alignment using a novel Weight space Rotation Process (WaRP). This model focuses on maintaining refusal capabilities for harmful requests while improving utility on reasoning tasks like GSM8K. It achieves a balanced safety-utility tradeoff through a 3-phase training pipeline that protects safety mechanisms via gradient masking.
Loading preview...
Overview
This model, kmseong/WaRP-Safety-Llama3_8B_Instruct, is an 8 billion parameter Llama 3.1 Instruct variant developed by Min-Seong Kim. It has been fine-tuned specifically for safety alignment using a unique Weight space Rotation Process (WaRP). The training methodology is a 3-phase pipeline designed to enhance safety without significantly compromising utility.
Key Capabilities
- Enhanced Safety Alignment: Utilizes a novel WaRP method to protect safety mechanisms.
- Refusal Capability: Maintains strong refusal capabilities for harmful or inappropriate requests.
- Improved Utility: Demonstrates improved performance on utility tasks, specifically GSM8K, through incremental learning with gradient masking.
- Balanced Tradeoff: Achieves a balance between safety and general utility, preventing safety enhancements from degrading other performance aspects.
Training Details
The WaRP training involves three phases: Basis Construction (identifying important neurons using safety data), Importance Scoring (calculating gradient-based scores and generating masks), and Incremental Learning (fine-tuning on utility tasks like openai/gsm8k while protecting important safety directions. The model was trained using LibrAI/do-not-answer for safety data.