kmseong/llama2_7b_chat-SSFT-AGNEWS-FT-safeInstr-0.1-lr5e-5

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Apr 30, 2026License:llama3.1Architecture:Transformer Cold

The kmseong/llama2_7b_chat-SSFT-AGNEWS-FT-safeInstr-0.1-lr5e-5 model is a 7 billion parameter Llama 3.1 8B Instruct fine-tune by kmseong, specifically aligned for safety using the Weight space Rotation Process (WaRP). This model employs a 3-phase training pipeline to protect safety mechanisms and maintain refusal capabilities for harmful requests while improving utility on reasoning tasks. It is designed to balance safety and performance, making it suitable for applications requiring robust safety alignment.

Loading preview...

WaRP-Safety-Llama3_8B_Instruct: Safety-Aligned Llama 3.1 8B

This model, developed by kmseong, is a fine-tuned version of meta-llama/Llama-3.1-8B-Instruct specifically engineered for enhanced safety alignment. It utilizes a novel Weight space Rotation Process (WaRP), a 3-phase pipeline designed to integrate safety without significantly compromising utility.

Key Capabilities & Training

  • Safety-First WaRP: Employs a unique three-phase training approach:
    • Basis Construction: Identifies important neurons related to safety using SVD on activations from FFN layers.
    • Importance Scoring: Calculates gradient-based importance scores to generate masks for critical safety directions.
    • Incremental Learning: Fine-tunes on utility tasks (like GSM8K) while protecting these important safety directions through gradient masking.
  • Balanced Safety-Utility: Aims to improve performance on reasoning tasks while preserving robust refusal capabilities for harmful requests.
  • Protected Safety Mechanisms: Ensures that the model maintains its ability to identify and refuse unsafe prompts.

Datasets Used

  • Safety Data: LibrAI/do-not-answer
  • Utility Data: openai/gsm8k

Use Cases

This model is particularly well-suited for applications where strong safety alignment is paramount, such as chatbots, content moderation, or any interactive AI system that needs to handle user inputs responsibly while still performing general reasoning tasks effectively. Users should still evaluate outputs and implement additional safety measures as needed.