WaRP-Safety-Llama3_8B_Instruct: Safety-Aligned Reasoning Model

This model, developed by Min-Seong Kim, is an 8 billion parameter Llama 3.1 Instruct variant fine-tuned for enhanced safety alignment using a novel Weight space Rotation Process (WaRP). The WaRP method employs a three-phase pipeline to balance safety and utility, making it distinct from standard instruction-tuned models.

Key Capabilities & Features

Safety-First Alignment: Utilizes a unique WaRP methodology to protect safety mechanisms through gradient masking, ensuring refusal capabilities for harmful requests.
Improved Reasoning Utility: Fine-tuned on the GSM8K dataset, demonstrating enhanced performance on mathematical and logical reasoning tasks.
Balanced Safety-Utility Tradeoff: Designed to maintain strong utility while rigorously preserving safety features, addressing a common challenge in LLM development.
Gradient Masking: Employs gradient masking during incremental learning to protect important neuronal directions identified as critical for safety.

Training & Methodology

The training involved a three-phase process:

Basis Construction: Identifying important neurons in FFN layers using safety data and SVD.
Importance Scoring: Calculating gradient-based importance scores and generating masks.
Incremental Learning: Fine-tuning on utility tasks (GSM8K) with gradient masking to protect safety-critical directions.

Ideal Use Cases

This model is particularly well-suited for applications where:

Robust safety alignment is paramount, preventing the generation of harmful content.
Strong performance in reasoning tasks, especially mathematical problem-solving, is required.
A balanced approach to safety and utility is preferred over models that might sacrifice one for the other.

Overview

WaRP-Safety-Llama3_8B_Instruct: Safety-Aligned Reasoning Model

Key Capabilities & Features

Training & Methodology

Ideal Use Cases

Full Model Card (README)