Name: kmseong/llama2_7b_chat-SSFT-AGNEWS-FT-safeInstr-0.1-lr5e-5 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: kmseong

WaRP-Safety-Llama3_8B_Instruct: Safety-Aligned Llama 3.1 8B

This model, developed by kmseong, is a fine-tuned version of meta-llama/Llama-3.1-8B-Instruct specifically engineered for enhanced safety alignment. It utilizes a novel Weight space Rotation Process (WaRP), a 3-phase pipeline designed to integrate safety without significantly compromising utility.

Key Capabilities & Training

Safety-First WaRP: Employs a unique three-phase training approach:
- Basis Construction: Identifies important neurons related to safety using SVD on activations from FFN layers.
- Importance Scoring: Calculates gradient-based importance scores to generate masks for critical safety directions.
- Incremental Learning: Fine-tunes on utility tasks (like GSM8K) while protecting these important safety directions through gradient masking.
Balanced Safety-Utility: Aims to improve performance on reasoning tasks while preserving robust refusal capabilities for harmful requests.
Protected Safety Mechanisms: Ensures that the model maintains its ability to identify and refuse unsafe prompts.

Datasets Used

Safety Data: LibrAI/do-not-answer
Utility Data: openai/gsm8k

Use Cases

This model is particularly well-suited for applications where strong safety alignment is paramount, such as chatbots, content moderation, or any interactive AI system that needs to handle user inputs responsibly while still performing general reasoning tasks effectively. Users should still evaluate outputs and implement additional safety measures as needed.

Overview

WaRP-Safety-Llama3_8B_Instruct: Safety-Aligned Llama 3.1 8B

Key Capabilities & Training

Datasets Used

Use Cases

Full Model Card (README)