kmseong/llama3.1-8b-base-gsm8k-safeinstr-ratio0.1-lr1e-5

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:May 6, 2026License:llama3.1Architecture:Transformer Warm

The kmseong/llama3.1-8b-base-gsm8k-safeinstr-ratio0.1-lr1e-5 is an 8 billion parameter Llama 3.1 Instruct model, fine-tuned by Min-Seong Kim using a Safety-First Weight space Rotation Process (WaRP). This model is specifically designed to enhance safety alignment while preserving utility on reasoning tasks, notably improving performance on benchmarks like GSM8K. It achieves a balanced safety-utility tradeoff by protecting safety mechanisms through gradient masking during incremental learning. This model is optimized for applications requiring robust safety features alongside strong performance in mathematical and logical reasoning.

Loading preview...

WaRP-Safety-Llama3_8B_Instruct: Safety-Aligned Reasoning Model

This model, developed by Min-Seong Kim, is an 8 billion parameter Llama 3.1 Instruct variant fine-tuned for enhanced safety alignment using a novel Weight space Rotation Process (WaRP). The WaRP method employs a three-phase pipeline to balance safety and utility, making it distinct from standard instruction-tuned models.

Key Capabilities & Features

  • Safety-First Alignment: Utilizes a unique WaRP methodology to protect safety mechanisms through gradient masking, ensuring refusal capabilities for harmful requests.
  • Improved Reasoning Utility: Fine-tuned on the GSM8K dataset, demonstrating enhanced performance on mathematical and logical reasoning tasks.
  • Balanced Safety-Utility Tradeoff: Designed to maintain strong utility while rigorously preserving safety features, addressing a common challenge in LLM development.
  • Gradient Masking: Employs gradient masking during incremental learning to protect important neuronal directions identified as critical for safety.

Training & Methodology

The training involved a three-phase process:

  1. Basis Construction: Identifying important neurons in FFN layers using safety data and SVD.
  2. Importance Scoring: Calculating gradient-based importance scores and generating masks.
  3. Incremental Learning: Fine-tuning on utility tasks (GSM8K) with gradient masking to protect safety-critical directions.

Ideal Use Cases

This model is particularly well-suited for applications where:

  • Robust safety alignment is paramount, preventing the generation of harmful content.
  • Strong performance in reasoning tasks, especially mathematical problem-solving, is required.
  • A balanced approach to safety and utility is preferred over models that might sacrifice one for the other.