kmseong/llama2_7b_chat-SSFT-AGNEWS-FT-safety-mix-0.1-lr3e-5
kmseong/llama2_7b_chat-SSFT-AGNEWS-FT-safety-mix-0.1-lr3e-5 is a 7 billion parameter Llama 3.1 Instruct model fine-tuned by kmseong for safety alignment using the Weight space Rotation Process (WaRP). This model is designed to maintain refusal capability for harmful requests while improving utility on reasoning tasks, balancing safety and performance. It leverages a three-phase training pipeline to protect safety mechanisms and enhance general utility.
Loading preview...
Overview
This model, kmseong/WaRP-Safety-Llama3_8B_Instruct, is a fine-tuned version of the Meta Llama 3.1 8B Instruct base model, specifically engineered for enhanced safety alignment. It utilizes a novel Weight space Rotation Process (WaRP), a 3-phase pipeline designed to balance safety and utility.
Key Capabilities
- Safety Alignment: Employs a "Safety-First WaRP" methodology to protect safety mechanisms and maintain refusal capabilities for harmful requests.
- Utility Improvement: While prioritizing safety, the model also shows improved utility on reasoning tasks, specifically fine-tuned using the GSM8K dataset.
- Gradient Masking: Achieves safety preservation through gradient masking during incremental learning, protecting important neuronal directions identified during basis construction.
- Balanced Trade-off: Aims to provide a robust balance between safety and general task performance.
Training Methodology
The WaRP process involves three distinct phases:
- Basis Construction: Identifies important neurons in FFN layers using safety data and SVD.
- Importance Scoring: Calculates gradient-based importance scores to generate masks for critical safety directions.
- Incremental Learning: Fine-tunes on utility tasks (like GSM8K) while applying gradient masking to preserve the identified safety mechanisms.
Should I use this for my use case?
This model is particularly suitable for applications where safety and responsible AI behavior are paramount, but general reasoning capabilities are also required. If your use case involves generating responses to user queries and you need strong refusal capabilities for harmful content while still performing well on tasks like mathematical reasoning, this model offers a specialized solution. Users should still evaluate outputs and implement additional safety measures as needed.