kmseong/llama3.1-8b-base-warp-gsm8k-lr1e-5
The kmseong/llama3.1-8b-base-warp-gsm8k-lr1e-5 is an 8 billion parameter Llama 3.1 Instruct model, fine-tuned by kmseong using the Weight space Rotation Process (WaRP) for safety alignment. This model uniquely balances safety mechanisms with utility, specifically improving performance on reasoning tasks like GSM8K while maintaining refusal capabilities for harmful requests. It features a 3-phase training pipeline that protects safety-critical directions during incremental learning. This model is optimized for applications requiring robust safety alongside strong mathematical and reasoning abilities.
Loading preview...
WaRP-Safety-Llama3_8B_Instruct: Safety-Aligned Llama 3.1
This model, developed by kmseong, is an 8 billion parameter Llama 3.1 Instruct variant specifically fine-tuned for enhanced safety alignment using a novel Weight space Rotation Process (WaRP). It addresses the critical balance between model utility and safety, ensuring robust performance while mitigating harmful outputs.
Key Capabilities & Features
- Safety-First WaRP Training: Employs a unique 3-phase pipeline:
- Basis Construction: Identifies important neurons related to safety from FFN layers using SVD on safety data.
- Importance Scoring: Calculates gradient-based importance scores to generate masks for critical directions.
- Incremental Learning: Fine-tunes on utility tasks (like GSM8K) with gradient masking to protect identified safety mechanisms.
- Balanced Safety-Utility Tradeoff: Designed to improve utility on reasoning tasks while preserving refusal capabilities for harmful requests.
- Base Model: Built upon
meta-llama/Llama-3.1-8B-Instruct. - Training Data: Utilizes
LibrAI/do-not-answerfor safety alignment andopenai/gsm8kfor utility improvement.
Good For
- Applications requiring a strong 8B language model with enhanced safety features.
- Use cases where maintaining refusal capabilities for harmful content is paramount.
- Scenarios demanding improved performance on mathematical and reasoning tasks (e.g., GSM8K) without compromising safety.
- Developers looking for a Llama 3.1 variant that has undergone specific safety alignment training.