kmseong/llama2_7b_chat-WaRP-original-space-gsm8k-lr5e-5

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:May 4, 2026License:llama3.1Architecture:Transformer Warm

The kmseong/llama2_7b_chat-WaRP-original-space-gsm8k-lr5e-5 is a 7 billion parameter Llama 3.1 Instruct model fine-tuned by Min-Seong Kim using the Safety-First WaRP (Weight space Rotation Process) method. This model is specifically designed for safety alignment, maintaining refusal capabilities for harmful requests while improving utility on reasoning tasks like GSM8K. It balances safety and performance by protecting critical safety mechanisms during incremental learning.

Loading preview...

Overview

This model, kmseong/llama2_7b_chat-WaRP-original-space-gsm8k-lr5e-5, is a 7 billion parameter Llama 3.1 Instruct variant developed by Min-Seong Kim. It has been fine-tuned using a novel Safety-First WaRP (Weight space Rotation Process), a three-phase pipeline designed to enhance safety alignment without significantly compromising utility.

Key Capabilities

  • Enhanced Safety Alignment: Utilizes a unique WaRP method to protect safety mechanisms, ensuring the model maintains refusal capabilities for harmful requests.
  • Improved Reasoning Utility: Fine-tuned on the GSM8K dataset, demonstrating improved performance on mathematical reasoning tasks.
  • Balanced Safety-Utility Tradeoff: Achieves a balance between safety and performance by employing gradient masking during incremental learning to preserve important safety-related directions.
  • Gradient-based Neuron Importance: Identifies and protects critical neurons (e.g., 419 neurons in layer 31) related to safety during the fine-tuning process.

Good For

  • Applications requiring a safety-aligned LLM that can effectively handle harmful prompts.
  • Use cases where mathematical reasoning and problem-solving are important, alongside safety.
  • Developers looking for a model that has undergone a structured process to balance ethical considerations with performance.

This model was trained using safety data from LibrAI/do-not-answer and utility data from openai/gsm8k, building upon the meta-llama/Llama-3.1-8B-Instruct base model.