kmseong/llama2_7b-chat-WaRP_only_prompt_lr5e-5

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Apr 29, 2026License:llama3.1Architecture:Transformer Cold

The kmseong/llama2_7b-chat-WaRP_only_prompt_lr5e-5 model is a 7 billion parameter Llama 3.1 8B Instruct variant, fine-tuned by Min-Seong Kim using the Safety-First Weight space Rotation Process (WaRP). This model is specifically designed for safety alignment, maintaining refusal capabilities for harmful requests while improving utility on reasoning tasks like GSM8K. It achieves a balanced safety-utility tradeoff through a unique three-phase training pipeline that protects safety mechanisms via gradient masking.

Loading preview...

Overview

This model, kmseong/llama2_7b-chat-WaRP_only_prompt_lr5e-5, is a safety-aligned fine-tune of the meta-llama/Llama-3.1-8B-Instruct base model, developed by Min-Seong Kim. It utilizes a novel Safety-First Weight space Rotation Process (WaRP), a three-phase training pipeline designed to enhance safety without significantly compromising utility.

Key Capabilities

  • Enhanced Safety Alignment: Specifically fine-tuned to maintain refusal capabilities for harmful requests.
  • Utility Preservation: Improves performance on reasoning tasks, demonstrated with the GSM8K dataset, while safeguarding safety mechanisms.
  • WaRP Training Method: Employs a unique process involving basis construction, importance scoring, and incremental learning with gradient masking to protect critical safety directions.
  • Balanced Tradeoff: Achieves an optimized balance between safety and general utility, making it suitable for applications requiring robust content moderation.

Training Details

The model's training involved:

  • Phase 1: Basis Construction: Identifying important neurons in FFN layers using safety data (LibrAI/do-not-answer).
  • Phase 2: Importance Scoring: Calculating gradient-based importance scores and generating masks for critical directions.
  • Phase 3: Incremental Learning: Fine-tuning on utility tasks (openai/gsm8k) with gradient masking to improve performance while preserving safety.

Good For

  • Applications requiring a strong emphasis on safety and refusal of harmful content.
  • Use cases where a balanced safety-utility tradeoff is crucial.
  • Developers looking for a Llama 3.1 8B Instruct variant with improved alignment against undesirable outputs.