skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 8, 2026Architecture:Transformer0.0K Cold

The skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal model is a variant of the Meta-Llama-3-8B-Instruct architecture, developed by YuanBoXie. This model is specifically fine-tuned to enhance its refusal mechanism, aiming to move beyond surface-level alignment in safety. It focuses on probabilistically ablating refusal directions to improve the robustness of safety mechanisms in large language models. This makes it particularly suitable for research into advanced LLM safety and refusal behavior.

Loading preview...

Overview

The skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal model is a specialized iteration of the Meta-Llama-3-8B-Instruct architecture. Developed by YuanBoXie, its core innovation lies in its enhanced safety mechanism, specifically targeting refusal behavior.

Key Capabilities

  • Advanced Refusal Mechanism: This model implements a novel approach to safety, moving "Beyond Surface Alignment" by probabilistically ablating refusal directions. This aims to create a more robust and less easily circumvented safety system compared to standard alignment techniques.
  • Research-Oriented: Based on the Meta-Llama-3-8B-Instruct foundation, it provides a strong base for exploring advanced safety techniques in LLMs.

Good For

  • LLM Safety Research: Ideal for researchers and developers focused on understanding and improving the refusal capabilities and safety mechanisms of large language models.
  • Probing Model Behavior: Useful for analyzing how models respond to harmful or inappropriate prompts, particularly in the context of deep refusal strategies.
  • Developing Robust AI: Contributes to the development of more secure and ethically aligned AI systems by addressing the nuances of refusal behavior.

This model is a direct result of the research presented in the paper "Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction," slated for EMNLP 2025.