kmseong/llama2_7b-SSFT-WaRP_original_space_freeze_30

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Apr 30, 2026License:llama3.1Architecture:Transformer Cold

The kmseong/llama2_7b-SSFT-WaRP_original_space_freeze_30 is a 7 billion parameter Llama 3.1 8B Instruct model fine-tuned by kmseong using the Safety-First Weight space Rotation Process (WaRP). This model is specifically designed for safety alignment, protecting safety mechanisms while improving utility on reasoning tasks. It excels at balancing safety-utility tradeoffs, making it suitable for applications requiring robust refusal capabilities for harmful requests.

Loading preview...

Overview

This model, kmseong/llama2_7b-SSFT-WaRP_original_space_freeze_30, is a Llama 3.1 8B Instruct variant fine-tuned by kmseong using a novel Safety-First Weight space Rotation Process (WaRP). The core innovation lies in its 3-phase training pipeline designed to enhance safety alignment without significantly compromising utility.

Key Capabilities

  • Enhanced Safety Alignment: Utilizes a unique WaRP method to protect safety mechanisms through gradient masking during fine-tuning.
  • Refusal Capability: Maintains robust refusal for harmful or inappropriate requests.
  • Improved Utility: Demonstrates improved performance on utility tasks, specifically reasoning, by balancing safety-utility tradeoffs.
  • Targeted Fine-tuning: The process involves constructing basis vectors from safety data, scoring neuron importance, and then incrementally learning utility tasks while preserving critical safety directions.

Training Details

The model's training involved three distinct phases:

  1. Basis Construction: Identifying important neurons (e.g., 419 neurons in layer 31) using SVD on activations from safety data.
  2. Importance Scoring: Calculating gradient-based importance scores to generate masks for these critical directions.
  3. Incremental Learning: Fine-tuning on utility tasks like GSM8K with gradient masking to protect the previously identified important safety directions.

Good For

  • Applications requiring a strong emphasis on safety and refusal capabilities.
  • Use cases where balancing safety with reasoning utility is crucial.
  • Developers looking for a Llama 3.1 8B Instruct base model with enhanced alignment against harmful content.