kmseong/llama2_7b_chat-SSFT-AGNEWS-FT-safety-mix-0.1-lr5e-5
The kmseong/llama2_7b_chat-SSFT-AGNEWS-FT-safety-mix-0.1-lr5e-5 model is a 7 billion parameter language model, fine-tuned from Llama 3.1 8B Instruct using a Safety-First Weight space Rotation Process (WaRP). This method focuses on safety alignment by protecting important directions during incremental learning, ensuring refusal capability for harmful requests while improving utility on reasoning tasks. It is designed to offer a balanced safety-utility tradeoff, making it suitable for applications requiring robust safety mechanisms.
Loading preview...
WaRP-Safety-Llama3_8B_Instruct: Safety-Aligned LLM
This model, developed by Min-Seong Kim, is a fine-tuned version of the Llama 3.1 8B Instruct base model, specifically engineered for enhanced safety alignment. It utilizes a novel Safety-First Weight space Rotation Process (WaRP), a three-phase training pipeline designed to maintain refusal capabilities for harmful content while simultaneously improving performance on utility tasks.
Key Capabilities & Training:
- Safety Alignment: Employs a unique WaRP method involving basis construction, importance scoring, and incremental learning with gradient masking to protect safety mechanisms.
- Balanced Performance: Achieves a balance between safety and utility, preserving safety features while enhancing reasoning capabilities, as demonstrated by fine-tuning on tasks like GSM8K.
- Robust Refusal: Designed to maintain strong refusal capabilities for potentially harmful requests, making it suitable for sensitive applications.
- Dataset Utilization: Trained using safety data from LibrAI/do-not-answer and utility data from openai/gsm8k.
When to Use This Model:
- Safety-Critical Applications: Ideal for use cases where robust safety and refusal of harmful content are paramount.
- Balanced Performance Needs: When seeking an LLM that offers both improved utility on reasoning tasks and strong safety mechanisms.
- Research in Safety Alignment: Useful for researchers exploring advanced safety alignment techniques like Weight space Rotation Process (WaRP).