kmseong/llama2_7b_chat-WaRP-circuit-breaker-gsm8k-lr5e-5

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:May 4, 2026License:llama3.1Architecture:Transformer Warm

kmseong/llama2_7b_chat-WaRP-circuit-breaker-gsm8k-lr5e-5 is a 7 billion parameter language model, fine-tuned from Llama 3.1 8B Instruct using a Safety-First WaRP (Weight space Rotation Process) three-phase pipeline. This model is specifically designed for safety alignment, protecting against harmful requests while improving utility on reasoning tasks like GSM8K. It balances refusal capabilities with enhanced performance on mathematical problem-solving, making it suitable for applications requiring both safety and reasoning. The model maintains safety mechanisms through gradient masking during incremental learning.

Loading preview...

Model Overview

This model, kmseong/llama2_7b_chat-WaRP-circuit-breaker-gsm8k-lr5e-5, is a 7 billion parameter variant fine-tuned from the meta-llama/Llama-3.1-8B-Instruct base model. Developed by Min-Seong Kim, it utilizes a novel Safety-First WaRP (Weight space Rotation Process), a three-phase pipeline designed to enhance safety alignment in large language models.

Key Capabilities

  • Safety Alignment: Employs a unique WaRP method to construct basis vectors from safety data, identify important neurons, and apply gradient masking during fine-tuning to protect safety mechanisms.
  • Refusal Capability: Maintains strong refusal capabilities for harmful requests, ensuring safer interactions.
  • Improved Utility: While prioritizing safety, the model also demonstrates improved utility on reasoning tasks, specifically fine-tuned on the openai/gsm8k dataset for mathematical problem-solving.
  • Balanced Trade-off: Achieves a balance between safety and utility, preventing degradation of reasoning performance while preserving safety features.

Training Details

The training involved three distinct phases:

  1. Basis Construction: Collected activations from FFN layers using safety data (LibrAI/do-not-answer) to derive orthonormal basis vectors.
  2. Importance Scoring: Calculated gradient-based importance scores to generate masks for critical directions.
  3. Incremental Learning: Fine-tuned on utility tasks (GSM8K) with gradient masking to protect identified important directions, thereby preserving safety while enhancing utility.

Good For

  • Applications requiring a strong emphasis on safety and refusal of harmful content.
  • Use cases where mathematical reasoning and problem-solving are important, alongside safety.
  • Developers looking for a model that explicitly addresses the safety-utility trade-off through a structured alignment process.