kmseong/llama2_7b-chat-WaRP_only_prompt_lr5e-5
The kmseong/llama2_7b-chat-WaRP_only_prompt_lr5e-5 model is a 7 billion parameter Llama 3.1 8B Instruct variant, fine-tuned by Min-Seong Kim using the Safety-First Weight space Rotation Process (WaRP). This model is specifically designed for safety alignment, maintaining refusal capabilities for harmful requests while improving utility on reasoning tasks like GSM8K. It achieves a balanced safety-utility tradeoff through a unique three-phase training pipeline that protects safety mechanisms via gradient masking.
Loading preview...
Overview
This model, kmseong/llama2_7b-chat-WaRP_only_prompt_lr5e-5, is a safety-aligned fine-tune of the meta-llama/Llama-3.1-8B-Instruct base model, developed by Min-Seong Kim. It utilizes a novel Safety-First Weight space Rotation Process (WaRP), a three-phase training pipeline designed to enhance safety without significantly compromising utility.
Key Capabilities
- Enhanced Safety Alignment: Specifically fine-tuned to maintain refusal capabilities for harmful requests.
- Utility Preservation: Improves performance on reasoning tasks, demonstrated with the GSM8K dataset, while safeguarding safety mechanisms.
- WaRP Training Method: Employs a unique process involving basis construction, importance scoring, and incremental learning with gradient masking to protect critical safety directions.
- Balanced Tradeoff: Achieves an optimized balance between safety and general utility, making it suitable for applications requiring robust content moderation.
Training Details
The model's training involved:
- Phase 1: Basis Construction: Identifying important neurons in FFN layers using safety data (LibrAI/do-not-answer).
- Phase 2: Importance Scoring: Calculating gradient-based importance scores and generating masks for critical directions.
- Phase 3: Incremental Learning: Fine-tuning on utility tasks (openai/gsm8k) with gradient masking to improve performance while preserving safety.
Good For
- Applications requiring a strong emphasis on safety and refusal of harmful content.
- Use cases where a balanced safety-utility tradeoff is crucial.
- Developers looking for a Llama 3.1 8B Instruct variant with improved alignment against undesirable outputs.