Name: kmseong/llama2_7b-chat-WaRP_only_prompt_lr5e-5 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: kmseong

Overview

This model, kmseong/llama2_7b-chat-WaRP_only_prompt_lr5e-5, is a safety-aligned fine-tune of the meta-llama/Llama-3.1-8B-Instruct base model, developed by Min-Seong Kim. It utilizes a novel Safety-First Weight space Rotation Process (WaRP), a three-phase training pipeline designed to enhance safety without significantly compromising utility.

Key Capabilities

Enhanced Safety Alignment: Specifically fine-tuned to maintain refusal capabilities for harmful requests.
Utility Preservation: Improves performance on reasoning tasks, demonstrated with the GSM8K dataset, while safeguarding safety mechanisms.
WaRP Training Method: Employs a unique process involving basis construction, importance scoring, and incremental learning with gradient masking to protect critical safety directions.
Balanced Tradeoff: Achieves an optimized balance between safety and general utility, making it suitable for applications requiring robust content moderation.

Training Details

The model's training involved:

Phase 1: Basis Construction: Identifying important neurons in FFN layers using safety data (LibrAI/do-not-answer).
Phase 2: Importance Scoring: Calculating gradient-based importance scores and generating masks for critical directions.
Phase 3: Incremental Learning: Fine-tuning on utility tasks (openai/gsm8k) with gradient masking to improve performance while preserving safety.

Good For

Applications requiring a strong emphasis on safety and refusal of harmful content.
Use cases where a balanced safety-utility tradeoff is crucial.
Developers looking for a Llama 3.1 8B Instruct variant with improved alignment against undesirable outputs.

Overview

Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)