Name: kmseong/llama3_2_3b-instruct-WaRP_lr3e-5 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: kmseong

Model Overview

This model, kmseong/llama3_2_3b-instruct-WaRP_lr3e-5, is a safety-aligned instruction-tuned variant of the meta-llama/Llama-3.1-8B-Instruct base model. It leverages a novel Safety-First Weight space Rotation Process (WaRP), a three-phase training pipeline designed to enhance safety without significantly compromising utility.

Key Capabilities & Features

Enhanced Safety Alignment: Utilizes a unique WaRP method to construct an orthonormal basis from safety data, identify important neurons, and apply gradient masking during fine-tuning.
Balanced Safety-Utility Tradeoff: Specifically engineered to protect safety mechanisms, ensuring robust refusal capabilities for harmful requests, while simultaneously improving performance on utility tasks like reasoning (e.g., GSM8K).
Targeted Fine-tuning: The training procedure involves collecting activations from FFN layers using safety data (LibrAI/do-not-answer), computing SVD for basis vectors, and then incrementally learning on utility data (openai/gsm8k) with gradient masking to preserve safety.

When to Use This Model

This model is particularly well-suited for applications where:

Safety is paramount: Its core design prioritizes maintaining refusal capabilities against harmful prompts.
Reasoning tasks are important: It demonstrates improved utility on tasks requiring logical inference.
A balance between safety and performance is desired: The WaRP method aims to optimize both aspects concurrently.

Users should still perform their own evaluations and implement additional safety measures as needed for specific use cases.

Overview

Model Overview

Key Capabilities & Features

When to Use This Model

Full Model Card (README)