kmseong/llama2_7b_chat-MBPP-FT-lr5e-5
The kmseong/llama2_7b_chat-MBPP-FT-lr5e-5 model is a 7 billion parameter Llama 2 based language model, fine-tuned for improved safety alignment using a Weight space Rotation Process (WaRP). This model focuses on maintaining refusal capabilities for harmful requests while enhancing utility on reasoning tasks. It is designed to balance safety and performance, making it suitable for applications requiring robust content moderation and reliable output generation.
Loading preview...
Model Overview
This model, kmseong/llama2_7b_chat-MBPP-FT-lr5e-5, is a 7 billion parameter variant of the Llama 2 architecture, specifically fine-tuned for enhanced safety alignment. It utilizes a novel Weight space Rotation Process (WaRP), a 3-phase pipeline designed to protect safety mechanisms while improving utility.
Key Capabilities & Features
- Safety-First WaRP Alignment: Employs a unique training methodology to ensure robust safety features.
- Protected Refusal Capability: Maintains the ability to refuse harmful requests effectively.
- Improved Utility: Enhances performance on reasoning tasks, balancing safety with practical application.
- Gradient Masking: Protects important neuronal directions during fine-tuning to preserve safety.
- Balanced Safety-Utility Tradeoff: Aims to provide a model that is both safe and useful for various applications.
Training Details
The model's training involved three phases:
- Basis Construction: Identified important neurons in FFN layers using safety data and SVD.
- Importance Scoring: Calculated gradient-based importance scores and generated masks.
- Incremental Learning: Fine-tuned on utility tasks (like GSM8K) with gradient masking to improve performance while preserving safety.
Datasets Used
- Safety Data: LibrAI/do-not-answer
- Utility Data: openai/gsm8k
Ideal Use Cases
This model is particularly well-suited for applications where:
- Content Moderation is critical, requiring a high degree of safety and refusal capability.
- Reasoning Tasks need reliable and safe outputs.
- A balanced approach to safety and utility is preferred over models optimized solely for performance or safety.