kmseong/llama3_2_3b-instruct-WaRP_lr3e-5

TEXT GENERATIONConcurrency Cost:1Model Size:3.2BQuant:BF16Ctx Length:32kPublished:Apr 28, 2026License:llama3.1Architecture:Transformer Cold

The kmseong/llama3_2_3b-instruct-WaRP_lr3e-5 is a 3.2 billion parameter instruction-tuned language model, fine-tuned from Llama 3.1 8B Instruct using the Safety-First Weight space Rotation Process (WaRP). This method focuses on safety alignment by protecting important directions in the weight space while improving utility. It is designed to maintain refusal capabilities for harmful requests while enhancing performance on reasoning tasks, offering a balanced safety-utility tradeoff.

Loading preview...

Model Overview

This model, kmseong/llama3_2_3b-instruct-WaRP_lr3e-5, is a safety-aligned instruction-tuned variant of the meta-llama/Llama-3.1-8B-Instruct base model. It leverages a novel Safety-First Weight space Rotation Process (WaRP), a three-phase training pipeline designed to enhance safety without significantly compromising utility.

Key Capabilities & Features

  • Enhanced Safety Alignment: Utilizes a unique WaRP method to construct an orthonormal basis from safety data, identify important neurons, and apply gradient masking during fine-tuning.
  • Balanced Safety-Utility Tradeoff: Specifically engineered to protect safety mechanisms, ensuring robust refusal capabilities for harmful requests, while simultaneously improving performance on utility tasks like reasoning (e.g., GSM8K).
  • Targeted Fine-tuning: The training procedure involves collecting activations from FFN layers using safety data (LibrAI/do-not-answer), computing SVD for basis vectors, and then incrementally learning on utility data (openai/gsm8k) with gradient masking to preserve safety.

When to Use This Model

This model is particularly well-suited for applications where:

  • Safety is paramount: Its core design prioritizes maintaining refusal capabilities against harmful prompts.
  • Reasoning tasks are important: It demonstrates improved utility on tasks requiring logical inference.
  • A balance between safety and performance is desired: The WaRP method aims to optimize both aspects concurrently.

Users should still perform their own evaluations and implement additional safety measures as needed for specific use cases.