Name: kmseong/llama2_7b_chat-SSFT-AGNEWS-FT-safety-mix-0.1-lr3e-5 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: kmseong

Overview

This model, kmseong/WaRP-Safety-Llama3_8B_Instruct, is a fine-tuned version of the Meta Llama 3.1 8B Instruct base model, specifically engineered for enhanced safety alignment. It utilizes a novel Weight space Rotation Process (WaRP), a 3-phase pipeline designed to balance safety and utility.

Key Capabilities

Safety Alignment: Employs a "Safety-First WaRP" methodology to protect safety mechanisms and maintain refusal capabilities for harmful requests.
Utility Improvement: While prioritizing safety, the model also shows improved utility on reasoning tasks, specifically fine-tuned using the GSM8K dataset.
Gradient Masking: Achieves safety preservation through gradient masking during incremental learning, protecting important neuronal directions identified during basis construction.
Balanced Trade-off: Aims to provide a robust balance between safety and general task performance.

Training Methodology

The WaRP process involves three distinct phases:

Basis Construction: Identifies important neurons in FFN layers using safety data and SVD.
Importance Scoring: Calculates gradient-based importance scores to generate masks for critical safety directions.
Incremental Learning: Fine-tunes on utility tasks (like GSM8K) while applying gradient masking to preserve the identified safety mechanisms.

Should I use this for my use case?

This model is particularly suitable for applications where safety and responsible AI behavior are paramount, but general reasoning capabilities are also required. If your use case involves generating responses to user queries and you need strong refusal capabilities for harmful content while still performing well on tasks like mathematical reasoning, this model offers a specialized solution. Users should still evaluate outputs and implement additional safety measures as needed.

Overview

Overview

Key Capabilities

Training Methodology

Should I use this for my use case?

Full Model Card (README)