wvnvwn/llama-2-13b-chat-hf-only-rsn-tuned-lr5e-5

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:13BQuant:FP8Ctx Length:4kPublished:May 2, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The wvnvwn/llama-2-13b-chat-hf-only-rsn-tuned-lr5e-5 is a 13 billion parameter Llama-2-chat-HF-based model, fine-tuned using the Safety Neuron Tuning (SN-Tune) method. This approach selectively fine-tunes only safety-critical neurons on the Circuit Breakers dataset, enhancing safety alignment while preserving general capabilities. It is designed to provide improved safety performance compared to its base model, meta-llama/Llama-3.2-3B-Instruct.

Loading preview...

Overview

This model, wvnvwn/llama-2-13b-chat-hf-only-rsn-tuned-lr5e-5, is a 13 billion parameter language model based on the Llama-2-chat-HF architecture. It has been specifically fine-tuned using a novel method called Safety Neuron Tuning (SN-Tune) to enhance its safety alignment.

Key Capabilities & Features

  • Safety Neuron Tuning (SN-Tune): A selective fine-tuning approach that identifies and fine-tunes only a small set of neurons critical for safety, while freezing all other parameters.
  • Enhanced Safety Alignment: By focusing on safety neurons and training on the Circuit Breakers dataset, the model aims to provide improved safety performance.
  • Preservation of General Capabilities: The SN-Tune method is designed to minimize impact on the model's general language understanding and generation abilities.
  • Parameter-Efficient Fine-tuning: This targeted approach makes the fine-tuning process more efficient.

When to Use This Model

This model is particularly suitable for applications where:

  • Safety is a primary concern: Its SN-Tune methodology makes it a strong candidate for use cases requiring robust safety alignment.
  • Maintaining base model capabilities is important: The selective fine-tuning ensures that the model's core functionalities are largely preserved.
  • Mitigating harmful outputs: It offers an improved defense against generating unsafe or undesirable content compared to its base model.