kmseong/llama2-7b-chat-lr5e-5-mmlu-lr5e-5
The kmseong/WaRP-Safety-Llama3_8B_Instruct is a 8 billion parameter Llama 3.1 Instruct model developed by Min-Seong Kim, fine-tuned using the Safety-First Weight space Rotation Process (WaRP). This model is specifically designed for safety alignment, maintaining refusal capabilities for harmful requests while improving utility on reasoning tasks. It achieves a balanced safety-utility tradeoff through a unique three-phase training pipeline involving gradient masking to protect safety mechanisms. Its primary use case is in applications requiring robust safety features alongside general reasoning capabilities.
Loading preview...
WaRP-Safety-Llama3_8B_Instruct: Safety-Aligned Llama 3.1
The kmseong/WaRP-Safety-Llama3_8B_Instruct is an 8 billion parameter model based on meta-llama/Llama-3.1-8B-Instruct, developed by Min-Seong Kim. This model stands out due to its novel Safety-First Weight space Rotation Process (WaRP), a three-phase training pipeline designed to enhance safety alignment without significantly compromising utility.
Key Capabilities & Training Highlights
- Advanced Safety Alignment: Utilizes a unique WaRP methodology to protect safety mechanisms and maintain refusal capabilities for harmful requests.
- Balanced Safety-Utility Tradeoff: Achieves improved utility on reasoning tasks (e.g., GSM8K) while preserving robust safety features through gradient masking.
- Three-Phase Training: Involves basis construction from safety data, importance scoring using gradient-based methods, and incremental learning with gradient masking to protect important safety directions.
- Targeted Neuron Protection: Identified and protected 419 important neurons in layer 31 during training to ensure safety preservation.
Good For
- Applications requiring a strong emphasis on safety and refusal of harmful content.
- Use cases where a balanced performance between safety and general reasoning is critical.
- Developers looking for a Llama 3.1 Instruct variant with enhanced safety alignment through a specialized fine-tuning process.