kmseong/llama2_7b_chat-SSFT-MEDQA-FT-lr3e-5

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Apr 30, 2026License:llama3.1Architecture:Transformer Cold

The kmseong/llama2_7b_chat-SSFT-MEDQA-FT-lr3e-5 model is a 7 billion parameter Llama 3.1 8B Instruct-based language model fine-tuned by kmseong for safety alignment using a Weight space Rotation Process (WaRP). This model focuses on balancing safety and utility, maintaining refusal capabilities for harmful requests while improving performance on reasoning tasks. It is specifically designed to enhance safety mechanisms through gradient masking and incremental learning.

Loading preview...

Model Overview

This model, kmseong/llama2_7b_chat-SSFT-MEDQA-FT-lr3e-5, is a Llama 3.1 8B Instruct variant fine-tuned by kmseong using a novel Weight space Rotation Process (WaRP). The primary goal of this 3-phase training pipeline is to achieve robust safety alignment while preserving and improving utility on reasoning tasks.

Key Capabilities

  • Enhanced Safety: Utilizes a "Safety-First WaRP" method to protect safety mechanisms through gradient masking, ensuring refusal capability for harmful requests.
  • Improved Utility: Despite its safety focus, the model demonstrates improved performance on utility tasks, specifically mentioned for reasoning tasks like GSM8K.
  • Balanced Trade-off: Designed to achieve a balanced trade-off between safety and utility, preventing significant degradation in performance while prioritizing safety.

Training Methodology

The model's unique training involves three phases:

  1. Basis Construction: Identifying important neurons in FFN layers using SVD on safety data.
  2. Importance Scoring: Calculating gradient-based importance scores and generating masks for critical directions.
  3. Incremental Learning: Fine-tuning on utility tasks (e.g., GSM8K) with gradient masking to protect identified safety-critical directions.

Good For

  • Applications requiring a safety-aligned LLM that can effectively refuse harmful prompts.
  • Use cases where a balance between safety and general utility is crucial.
  • Developers interested in models trained with advanced safety alignment techniques like WaRP.