Name: kmseong/llama2_7b_chat-SSFT-MMLU-FT-SafeInstr-0.1-lr3e-5 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: kmseong

Overview

This model, kmseong/llama2_7b_chat-SSFT-MMLU-FT-SafeInstr-0.1-lr3e-5, is a safety-aligned version of the Llama 3.1 8B Instruct base model. It was fine-tuned using a novel Weight space Rotation Process (WaRP), a three-phase pipeline designed to enhance safety without significantly compromising utility. The training involved identifying and protecting important neuronal directions related to safety, ensuring the model maintains its refusal capabilities for harmful content.

Key Capabilities

Enhanced Safety Alignment: Utilizes a unique WaRP method to protect safety mechanisms through gradient masking.
Harmful Request Refusal: Maintains strong refusal capabilities when confronted with harmful prompts.
Improved Utility: Demonstrates improved performance on reasoning tasks, specifically fine-tuned using the GSM8K dataset.
Balanced Performance: Achieves a balance between safety and general utility, addressing a common challenge in LLM alignment.

Training Details

The WaRP training procedure involved three phases:

Basis Construction: Collected activations from FFN layers using safety data (LibrAI/do-not-answer) and computed SVD to identify important neurons.
Importance Scoring: Calculated importance scores using gradient-based methods to generate masks for these important directions.
Incremental Learning: Fine-tuned on utility tasks (openai/gsm8k) with gradient masking to improve utility while preserving the established safety mechanisms.

Good For

Applications requiring a robustly safety-aligned language model.
Use cases where maintaining refusal capabilities for harmful content is critical.
Scenarios needing a model that balances safety with reasoning performance.

Overview

Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)