Name: kmseong/llama2_7b_chat-SSFT-MMLU-FT-lr3e-5 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: kmseong

Overview

This model, kmseong/WaRP-Safety-Llama3_8B_Instruct, is an 8 billion parameter Llama 3.1 Instruct variant developed by Min-Seong Kim. It has been fine-tuned specifically for safety alignment using a unique Weight space Rotation Process (WaRP). The training methodology is a 3-phase pipeline designed to enhance safety without significantly compromising utility.

Key Capabilities

Enhanced Safety Alignment: Utilizes a novel WaRP method to protect safety mechanisms.
Refusal Capability: Maintains strong refusal capabilities for harmful or inappropriate requests.
Improved Utility: Demonstrates improved performance on utility tasks, specifically GSM8K, through incremental learning with gradient masking.
Balanced Tradeoff: Achieves a balance between safety and general utility, preventing safety enhancements from degrading other performance aspects.

Training Details

The WaRP training involves three phases: Basis Construction (identifying important neurons using safety data), Importance Scoring (calculating gradient-based scores and generating masks), and Incremental Learning (fine-tuning on utility tasks like openai/gsm8k while protecting important safety directions. The model was trained using LibrAI/do-not-answer for safety data.

Overview

Overview

Key Capabilities

Training Details

Full Model Card (README)