kmseong/llama2_7b-chat-WaRP_new_basis_lr5e-5

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Apr 29, 2026License:llama3.1Architecture:Transformer Cold

kmseong/llama2_7b-chat-WaRP_new_basis_lr5e-5 is a 7 billion parameter Llama 3.1 8B Instruct model fine-tuned by Min-Seong Kim using a Safety-First Weight space Rotation Process (WaRP). This model is specifically designed for safety alignment, maintaining refusal capabilities for harmful requests while improving utility on reasoning tasks. It achieves a balanced safety-utility tradeoff through a three-phase training pipeline involving basis construction, importance scoring, and incremental learning with gradient masking.

Loading preview...

Model Overview

This model, kmseong/llama2_7b-chat-WaRP_new_basis_lr5e-5, is a 7 billion parameter variant of the Llama 3.1 8B Instruct base model, fine-tuned by Min-Seong Kim. Its core innovation lies in the Safety-First Weight space Rotation Process (WaRP), a three-phase training methodology aimed at enhancing safety alignment without significantly compromising utility.

Key Capabilities & Features

  • Enhanced Safety Alignment: Utilizes a novel WaRP method to protect safety mechanisms and maintain refusal capabilities for harmful requests.
  • Utility Preservation: Improves performance on reasoning tasks, specifically demonstrated with the GSM8K dataset, while safeguarding safety features.
  • Gradient Masking: Employs gradient-based importance scoring and masking during incremental learning to balance safety and utility.
  • Targeted Neuron Protection: Identifies and protects important neurons (e.g., 419 neurons in layer 31) critical for safety responses.

Training Methodology

The WaRP process involves:

  1. Basis Construction: Collecting activations from FFN layers using safety data and computing SVD to obtain orthonormal basis vectors.
  2. Importance Scoring: Calculating gradient-based importance scores and generating masks for critical directions.
  3. Incremental Learning: Fine-tuning on utility tasks (like GSM8K) with gradient masking to improve utility while preserving safety.

Use Cases

This model is particularly well-suited for applications requiring a strong emphasis on safety and responsible AI, where maintaining refusal capabilities for harmful content is paramount, alongside general reasoning abilities. It offers a balanced approach for developers looking to deploy LLMs with improved safety characteristics.