kmseong/llama2_7b-SSFT-WaRP_medqa_FT_lr3e-5-2

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Apr 30, 2026License:llama3.1Architecture:Transformer Cold

The kmseong/llama2_7b-SSFT-WaRP_medqa_FT_lr3e-5-2 model is a 7 billion parameter Llama 3.1 8B Instruct model fine-tuned for safety alignment using the Weight space Rotation Process (WaRP). This model, developed by Min-Seong Kim, focuses on maintaining refusal capabilities for harmful requests while improving utility on reasoning tasks. It is specifically designed to balance safety and performance, making it suitable for applications requiring robust safety mechanisms.

Loading preview...

Model Overview

The kmseong/llama2_7b-SSFT-WaRP_medqa_FT_lr3e-5-2 model is a Llama 3.1 8B Instruct variant, developed by Min-Seong Kim, that has undergone a specialized fine-tuning process called Safety-First WaRP (Weight space Rotation Process). This 3-phase pipeline aims to enhance safety alignment while preserving and improving utility on general tasks.

Key Capabilities

  • Enhanced Safety Alignment: Utilizes a novel WaRP method to protect safety mechanisms through gradient masking, ensuring robust refusal capabilities for harmful queries.
  • Balanced Safety-Utility Tradeoff: Designed to improve performance on reasoning tasks (e.g., GSM8K) without compromising safety features.
  • Targeted Fine-tuning: The training procedure involves constructing a basis from safety data, scoring neuron importance, and incrementally learning utility tasks while protecting critical safety directions.

Training Details

The model was trained using a three-phase approach:

  1. Basis Construction: Identified important neurons in FFN layers using safety data (LibrAI/do-not-answer) and Singular Value Decomposition (SVD).
  2. Importance Scoring: Calculated gradient-based importance scores to generate masks for these critical directions.
  3. Incremental Learning: Fine-tuned on utility data (openai/gsm8k) with gradient masking to improve performance while preserving the identified safety mechanisms.

Good For

  • Applications requiring a strong emphasis on safety and refusal of harmful content.
  • Use cases where a balance between safety and general reasoning utility is crucial.
  • Developers looking for a Llama 3.1 8B Instruct base model with enhanced alignment against unsafe outputs.