kmseong/llama-2-13b_WaRP-cb_alpha5_layers10-20_lr1e-4-lr5e-5
The kmseong/llama-2-13b_WaRP-cb_alpha5_layers10-20_lr1e-4-lr5e-5 model is a 13 billion parameter Llama 2 based language model, fine-tuned using a Weight space Rotation Process (WaRP) for enhanced safety alignment. This model focuses on balancing safety mechanisms with utility, aiming to maintain refusal capabilities for harmful requests while improving performance on reasoning tasks. It is particularly suited for applications requiring robust safety features without significant degradation in general utility.
Loading preview...
WaRP-Safety-Llama3_8B_Instruct: Safety-Aligned LLM
This model, developed by kmseong, is a fine-tuned version of the Llama 3.1 8B Instruct base model, specifically engineered for safety alignment using a novel Weight space Rotation Process (WaRP). The WaRP method employs a 3-phase pipeline to achieve a balanced safety-utility tradeoff.
Key Capabilities & Training:
- Safety-First WaRP Pipeline: Utilizes a three-phase process involving Basis Construction, Importance Scoring, and Incremental Learning.
- Gradient Masking: Protects critical safety mechanisms by identifying and preserving important neuronal directions during fine-tuning.
- Refusal Capability: Designed to maintain strong refusal capabilities for harmful or unsafe requests.
- Improved Utility: While prioritizing safety, the model also shows improved utility on reasoning tasks, specifically fine-tuned using the GSM8K dataset.
- Balanced Performance: Aims to provide a robust balance between safety and general task performance.
Use Cases:
- Applications requiring enhanced safety and ethical AI responses.
- Scenarios where maintaining refusal for harmful content is critical.
- Tasks benefiting from a reasoning-capable model with strong safety guardrails.
This model was trained on safety data from LibrAI/do-not-answer and utility data from openai/gsm8k, ensuring a comprehensive approach to safety alignment.