kmseong/llama2-7b-chat-lr5e-5-mmlu-lr5e-5

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:May 6, 2026License:llama3.1Architecture:Transformer Warm

The kmseong/WaRP-Safety-Llama3_8B_Instruct is a 8 billion parameter Llama 3.1 Instruct model developed by Min-Seong Kim, fine-tuned using the Safety-First Weight space Rotation Process (WaRP). This model is specifically designed for safety alignment, maintaining refusal capabilities for harmful requests while improving utility on reasoning tasks. It achieves a balanced safety-utility tradeoff through a unique three-phase training pipeline involving gradient masking to protect safety mechanisms. Its primary use case is in applications requiring robust safety features alongside general reasoning capabilities.

Loading preview...

WaRP-Safety-Llama3_8B_Instruct: Safety-Aligned Llama 3.1

The kmseong/WaRP-Safety-Llama3_8B_Instruct is an 8 billion parameter model based on meta-llama/Llama-3.1-8B-Instruct, developed by Min-Seong Kim. This model stands out due to its novel Safety-First Weight space Rotation Process (WaRP), a three-phase training pipeline designed to enhance safety alignment without significantly compromising utility.

Key Capabilities & Training Highlights

  • Advanced Safety Alignment: Utilizes a unique WaRP methodology to protect safety mechanisms and maintain refusal capabilities for harmful requests.
  • Balanced Safety-Utility Tradeoff: Achieves improved utility on reasoning tasks (e.g., GSM8K) while preserving robust safety features through gradient masking.
  • Three-Phase Training: Involves basis construction from safety data, importance scoring using gradient-based methods, and incremental learning with gradient masking to protect important safety directions.
  • Targeted Neuron Protection: Identified and protected 419 important neurons in layer 31 during training to ensure safety preservation.

Good For

  • Applications requiring a strong emphasis on safety and refusal of harmful content.
  • Use cases where a balanced performance between safety and general reasoning is critical.
  • Developers looking for a Llama 3.1 Instruct variant with enhanced safety alignment through a specialized fine-tuning process.