Name: kmseong/llama-2-13b_WaRP-cb_alpha5_layers10-20_lr1e-4-lr5e-5 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: kmseong

WaRP-Safety-Llama3_8B_Instruct: Safety-Aligned LLM

This model, developed by kmseong, is a fine-tuned version of the Llama 3.1 8B Instruct base model, specifically engineered for safety alignment using a novel Weight space Rotation Process (WaRP). The WaRP method employs a 3-phase pipeline to achieve a balanced safety-utility tradeoff.

Key Capabilities & Training:

Safety-First WaRP Pipeline: Utilizes a three-phase process involving Basis Construction, Importance Scoring, and Incremental Learning.
Gradient Masking: Protects critical safety mechanisms by identifying and preserving important neuronal directions during fine-tuning.
Refusal Capability: Designed to maintain strong refusal capabilities for harmful or unsafe requests.
Improved Utility: While prioritizing safety, the model also shows improved utility on reasoning tasks, specifically fine-tuned using the GSM8K dataset.
Balanced Performance: Aims to provide a robust balance between safety and general task performance.

Use Cases:

Applications requiring enhanced safety and ethical AI responses.
Scenarios where maintaining refusal for harmful content is critical.
Tasks benefiting from a reasoning-capable model with strong safety guardrails.

This model was trained on safety data from LibrAI/do-not-answer and utility data from openai/gsm8k, ensuring a comprehensive approach to safety alignment.

Overview

WaRP-Safety-Llama3_8B_Instruct: Safety-Aligned LLM

Key Capabilities & Training:

Use Cases:

Full Model Card (README)