Name: kmseong/llama2_7b_chat-SSFT-AGNEWS-FT-safety-mix-0.1-lr5e-5 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: kmseong

WaRP-Safety-Llama3_8B_Instruct: Safety-Aligned LLM

This model, developed by Min-Seong Kim, is a fine-tuned version of the Llama 3.1 8B Instruct base model, specifically engineered for enhanced safety alignment. It utilizes a novel Safety-First Weight space Rotation Process (WaRP), a three-phase training pipeline designed to maintain refusal capabilities for harmful content while simultaneously improving performance on utility tasks.

Key Capabilities & Training:

Safety Alignment: Employs a unique WaRP method involving basis construction, importance scoring, and incremental learning with gradient masking to protect safety mechanisms.
Balanced Performance: Achieves a balance between safety and utility, preserving safety features while enhancing reasoning capabilities, as demonstrated by fine-tuning on tasks like GSM8K.
Robust Refusal: Designed to maintain strong refusal capabilities for potentially harmful requests, making it suitable for sensitive applications.
Dataset Utilization: Trained using safety data from LibrAI/do-not-answer and utility data from openai/gsm8k.

When to Use This Model:

Safety-Critical Applications: Ideal for use cases where robust safety and refusal of harmful content are paramount.
Balanced Performance Needs: When seeking an LLM that offers both improved utility on reasoning tasks and strong safety mechanisms.
Research in Safety Alignment: Useful for researchers exploring advanced safety alignment techniques like Weight space Rotation Process (WaRP).

Overview

WaRP-Safety-Llama3_8B_Instruct: Safety-Aligned LLM

Key Capabilities & Training:

When to Use This Model:

Full Model Card (README)