Name: kmseong/llama2_7b-SSFT-WaRP_medqa_FT_lr3e-5-2 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: kmseong

Model Overview

The kmseong/llama2_7b-SSFT-WaRP_medqa_FT_lr3e-5-2 model is a Llama 3.1 8B Instruct variant, developed by Min-Seong Kim, that has undergone a specialized fine-tuning process called Safety-First WaRP (Weight space Rotation Process). This 3-phase pipeline aims to enhance safety alignment while preserving and improving utility on general tasks.

Key Capabilities

Enhanced Safety Alignment: Utilizes a novel WaRP method to protect safety mechanisms through gradient masking, ensuring robust refusal capabilities for harmful queries.
Balanced Safety-Utility Tradeoff: Designed to improve performance on reasoning tasks (e.g., GSM8K) without compromising safety features.
Targeted Fine-tuning: The training procedure involves constructing a basis from safety data, scoring neuron importance, and incrementally learning utility tasks while protecting critical safety directions.

Training Details

The model was trained using a three-phase approach:

Basis Construction: Identified important neurons in FFN layers using safety data (LibrAI/do-not-answer) and Singular Value Decomposition (SVD).
Importance Scoring: Calculated gradient-based importance scores to generate masks for these critical directions.
Incremental Learning: Fine-tuned on utility data (openai/gsm8k) with gradient masking to improve performance while preserving the identified safety mechanisms.

Good For

Applications requiring a strong emphasis on safety and refusal of harmful content.
Use cases where a balance between safety and general reasoning utility is crucial.
Developers looking for a Llama 3.1 8B Instruct base model with enhanced alignment against unsafe outputs.

Overview

Model Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)