Name: kmseong/llama2_7b_chat-arc-c-WaRP-lr5e-5 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: kmseong

Overview

This model, kmseong/WaRP-Safety-Llama3_8B_Instruct, is a fine-tuned version of meta-llama/Llama-3.1-8B-Instruct developed by kmseong. Its primary innovation lies in its safety alignment achieved through a 3-phase Weight space Rotation Process (WaRP) pipeline. The training procedure involved constructing an orthonormal basis from FFN layer activations using safety data, scoring neuron importance with gradient-based methods, and then incrementally learning utility tasks (like GSM8K) while protecting these identified safety-critical directions via gradient masking.

Key Capabilities

Enhanced Safety Alignment: Utilizes a novel WaRP method to embed safety mechanisms directly into the model's weight space.
Harmful Request Refusal: Designed to maintain strong refusal capabilities for inappropriate or harmful queries.
Balanced Utility: Improves performance on reasoning tasks (e.g., GSM8K) while preserving safety, aiming for an optimal safety-utility tradeoff.
Gradient Masking: Employs gradient masking during fine-tuning to protect important safety-related directions.

Good For

Applications requiring a robustly safety-aligned LLM.
Use cases where maintaining refusal for harmful content is critical.
Scenarios needing a balance between general utility and strong safety features.

Overview

Overview

Key Capabilities

Good For

Full Model Card (README)