kmseong/llama3_2_3b-instruct-SSFT-lr5e-5
The kmseong/llama3_2_3b-instruct-SSFT-lr5e-5 model is a 3.2 billion parameter instruction-tuned causal language model based on the Llama 3.1 architecture. It has been fine-tuned using a Safety-First Weight space Rotation Process (WaRP) to enhance safety alignment. This model is specifically designed to maintain refusal capabilities for harmful requests while improving utility on reasoning tasks, offering a balanced safety-utility tradeoff.
Loading preview...
Model Overview
The kmseong/llama3_2_3b-instruct-SSFT-lr5e-5 model is a 3.2 billion parameter instruction-tuned variant of the Llama 3.1 architecture, developed by kmseong. It stands out due to its unique Safety-First WaRP (Weight space Rotation Process), a three-phase training pipeline designed to achieve robust safety alignment.
Key Capabilities & Training
This model was fine-tuned using a sophisticated process:
- Basis Construction: Activations from FFN layers were collected using safety data, and SVD was applied to identify 419 important neurons in layer 31.
- Importance Scoring: Gradient-based methods were used to calculate importance scores and generate masks for critical directions, with teacher forcing on safety responses.
- Incremental Learning: The model was further fine-tuned on utility tasks (like GSM8K) using gradient masking, which protected the identified important directions to preserve safety mechanisms while improving general utility.
Safety Features & Use Cases
The WaRP methodology ensures that the model:
- Protects safety mechanisms through gradient masking.
- Maintains refusal capability for harmful requests.
- Improves utility on reasoning tasks, as demonstrated by its training on the
openai/gsm8kdataset. - Achieves a balanced safety-utility tradeoff, making it suitable for applications where both performance and responsible AI behavior are critical. It leverages safety data from
LibrAI/do-not-answerto enhance its refusal capabilities.