kmseong/llama2_7b_chat-SSFT-MMLU-FT-SafeInstr-0.1-lr3e-5

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Apr 30, 2026License:llama3.1Architecture:Transformer Cold

The kmseong/llama2_7b_chat-SSFT-MMLU-FT-SafeInstr-0.1-lr3e-5 model is a 7 billion parameter Llama 3.1 8B Instruct base model fine-tuned for safety alignment using the Weight space Rotation Process (WaRP). This model focuses on maintaining refusal capabilities for harmful requests while improving utility on reasoning tasks. It is designed to balance safety and performance, making it suitable for applications requiring robust safety features.

Loading preview...

Overview

This model, kmseong/llama2_7b_chat-SSFT-MMLU-FT-SafeInstr-0.1-lr3e-5, is a safety-aligned version of the Llama 3.1 8B Instruct base model. It was fine-tuned using a novel Weight space Rotation Process (WaRP), a three-phase pipeline designed to enhance safety without significantly compromising utility. The training involved identifying and protecting important neuronal directions related to safety, ensuring the model maintains its refusal capabilities for harmful content.

Key Capabilities

  • Enhanced Safety Alignment: Utilizes a unique WaRP method to protect safety mechanisms through gradient masking.
  • Harmful Request Refusal: Maintains strong refusal capabilities when confronted with harmful prompts.
  • Improved Utility: Demonstrates improved performance on reasoning tasks, specifically fine-tuned using the GSM8K dataset.
  • Balanced Performance: Achieves a balance between safety and general utility, addressing a common challenge in LLM alignment.

Training Details

The WaRP training procedure involved three phases:

  1. Basis Construction: Collected activations from FFN layers using safety data (LibrAI/do-not-answer) and computed SVD to identify important neurons.
  2. Importance Scoring: Calculated importance scores using gradient-based methods to generate masks for these important directions.
  3. Incremental Learning: Fine-tuned on utility tasks (openai/gsm8k) with gradient masking to improve utility while preserving the established safety mechanisms.

Good For

  • Applications requiring a robustly safety-aligned language model.
  • Use cases where maintaining refusal capabilities for harmful content is critical.
  • Scenarios needing a model that balances safety with reasoning performance.