Name: kmseong/llama2_7b_chat-WaRP-circuit-breaker-gsm8k-lr5e-5 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: kmseong

Model Overview

This model, kmseong/llama2_7b_chat-WaRP-circuit-breaker-gsm8k-lr5e-5, is a 7 billion parameter variant fine-tuned from the meta-llama/Llama-3.1-8B-Instruct base model. Developed by Min-Seong Kim, it utilizes a novel Safety-First WaRP (Weight space Rotation Process), a three-phase pipeline designed to enhance safety alignment in large language models.

Key Capabilities

Safety Alignment: Employs a unique WaRP method to construct basis vectors from safety data, identify important neurons, and apply gradient masking during fine-tuning to protect safety mechanisms.
Refusal Capability: Maintains strong refusal capabilities for harmful requests, ensuring safer interactions.
Improved Utility: While prioritizing safety, the model also demonstrates improved utility on reasoning tasks, specifically fine-tuned on the openai/gsm8k dataset for mathematical problem-solving.
Balanced Trade-off: Achieves a balance between safety and utility, preventing degradation of reasoning performance while preserving safety features.

Training Details

The training involved three distinct phases:

Basis Construction: Collected activations from FFN layers using safety data (LibrAI/do-not-answer) to derive orthonormal basis vectors.
Importance Scoring: Calculated gradient-based importance scores to generate masks for critical directions.
Incremental Learning: Fine-tuned on utility tasks (GSM8K) with gradient masking to protect identified important directions, thereby preserving safety while enhancing utility.

Good For

Applications requiring a strong emphasis on safety and refusal of harmful content.
Use cases where mathematical reasoning and problem-solving are important, alongside safety.
Developers looking for a model that explicitly addresses the safety-utility trade-off through a structured alignment process.

Overview

Model Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)