kmseong/llama2_7b-SSFT-WaRP_agnews_FT_lr3e-5
kmseong/llama2_7b-SSFT-WaRP_agnews_FT_lr3e-5 is a 7 billion parameter Llama 3.1 8B Instruct model fine-tuned by Min-Seong Kim using a Safety-First Weight space Rotation Process (WaRP). This model is specifically designed for safety alignment, balancing refusal capabilities for harmful requests with improved utility on reasoning tasks. It excels at maintaining safety mechanisms while enhancing performance on tasks like mathematical reasoning.
Loading preview...
Model Overview
This model, kmseong/WaRP-Safety-Llama3_8B_Instruct, is a fine-tuned version of the meta-llama/Llama-3.1-8B-Instruct base model, developed by Min-Seong Kim. It leverages a novel Safety-First Weight space Rotation Process (WaRP), a 3-phase pipeline designed to enhance safety alignment in large language models.
Key Capabilities & Features
- Enhanced Safety Alignment: Utilizes a sophisticated WaRP method to protect safety mechanisms through gradient masking, ensuring robust refusal capabilities for harmful requests.
- Balanced Safety-Utility Tradeoff: Achieves an improved balance between safety and utility, demonstrating enhanced performance on reasoning tasks (e.g., GSM8K) while preserving safety.
- Targeted Fine-tuning: The training procedure involved identifying and protecting important neuronal directions, allowing for incremental learning on utility tasks without compromising safety.
Training Details
The WaRP training involved three phases:
- Basis Construction: Collected FFN layer activations from safety data (LibrAI/do-not-answer) to compute orthonormal basis vectors.
- Importance Scoring: Used gradient-based methods to score and mask important directions, guided by teacher forcing on safety responses.
- Incremental Learning: Fine-tuned on utility data (openai/gsm8k) with gradient masking to improve utility while safeguarding critical safety directions.
Ideal Use Cases
This model is particularly well-suited for applications where:
- Safety is paramount: Requiring a model that reliably refuses harmful or inappropriate queries.
- Reasoning tasks are important: Benefiting from improved utility on tasks like mathematical problem-solving.
- Balancing safety and performance is a key requirement, offering a robust solution for general-purpose conversational AI with strong safety guardrails.