kmseong/llama3_2_3b-instruct-SSFT-lr5e-5

TEXT GENERATIONConcurrency Cost:1Model Size:3.2BQuant:BF16Ctx Length:32kPublished:Apr 28, 2026License:llama3.1Architecture:Transformer Cold

The kmseong/llama3_2_3b-instruct-SSFT-lr5e-5 model is a 3.2 billion parameter instruction-tuned causal language model based on the Llama 3.1 architecture. It has been fine-tuned using a Safety-First Weight space Rotation Process (WaRP) to enhance safety alignment. This model is specifically designed to maintain refusal capabilities for harmful requests while improving utility on reasoning tasks, offering a balanced safety-utility tradeoff.

Loading preview...

Model Overview

The kmseong/llama3_2_3b-instruct-SSFT-lr5e-5 model is a 3.2 billion parameter instruction-tuned variant of the Llama 3.1 architecture, developed by kmseong. It stands out due to its unique Safety-First WaRP (Weight space Rotation Process), a three-phase training pipeline designed to achieve robust safety alignment.

Key Capabilities & Training

This model was fine-tuned using a sophisticated process:

  • Basis Construction: Activations from FFN layers were collected using safety data, and SVD was applied to identify 419 important neurons in layer 31.
  • Importance Scoring: Gradient-based methods were used to calculate importance scores and generate masks for critical directions, with teacher forcing on safety responses.
  • Incremental Learning: The model was further fine-tuned on utility tasks (like GSM8K) using gradient masking, which protected the identified important directions to preserve safety mechanisms while improving general utility.

Safety Features & Use Cases

The WaRP methodology ensures that the model:

  • Protects safety mechanisms through gradient masking.
  • Maintains refusal capability for harmful requests.
  • Improves utility on reasoning tasks, as demonstrated by its training on the openai/gsm8k dataset.
  • Achieves a balanced safety-utility tradeoff, making it suitable for applications where both performance and responsible AI behavior are critical. It leverages safety data from LibrAI/do-not-answer to enhance its refusal capabilities.