Name: kmseong/llama2_7b_chat-SSFT-MEDQA-FT-lr3e-5 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: kmseong

Model Overview

This model, kmseong/llama2_7b_chat-SSFT-MEDQA-FT-lr3e-5, is a Llama 3.1 8B Instruct variant fine-tuned by kmseong using a novel Weight space Rotation Process (WaRP). The primary goal of this 3-phase training pipeline is to achieve robust safety alignment while preserving and improving utility on reasoning tasks.

Key Capabilities

Enhanced Safety: Utilizes a "Safety-First WaRP" method to protect safety mechanisms through gradient masking, ensuring refusal capability for harmful requests.
Improved Utility: Despite its safety focus, the model demonstrates improved performance on utility tasks, specifically mentioned for reasoning tasks like GSM8K.
Balanced Trade-off: Designed to achieve a balanced trade-off between safety and utility, preventing significant degradation in performance while prioritizing safety.

Training Methodology

The model's unique training involves three phases:

Basis Construction: Identifying important neurons in FFN layers using SVD on safety data.
Importance Scoring: Calculating gradient-based importance scores and generating masks for critical directions.
Incremental Learning: Fine-tuning on utility tasks (e.g., GSM8K) with gradient masking to protect identified safety-critical directions.

Good For

Applications requiring a safety-aligned LLM that can effectively refuse harmful prompts.
Use cases where a balance between safety and general utility is crucial.
Developers interested in models trained with advanced safety alignment techniques like WaRP.

Overview

Model Overview

Key Capabilities

Training Methodology

Good For

Full Model Card (README)