Vaxispraxis/Llama-3.1-8B-Instruct-heretic is an 8 billion parameter instruction-tuned Llama 3.1 model developed by Vaxispraxis. This model has undergone post-training behavioral modification using the Heretic framework to significantly reduce refusal responses and increase permissiveness. It achieves this by manipulating residual streams and subtracting refusal-associated components, making it suitable for use cases requiring less constrained outputs.
Loading preview...
Overview of Llama-3.1-8B-Instruct-Heretic
This model is a specialized version of the Llama 3.1 8B Instruct, developed by Vaxispraxis, focusing on post-training behavioral modification rather than traditional fine-tuning. Its core innovation lies in using the Heretic framework to reduce refusal responses and enhance directness in outputs.
Key Capabilities and Methodology
Unlike standard fine-tuning, Llama-3.1-8B-Instruct-Heretic employs:
- Residual stream manipulation: Directly altering the model's internal processing.
- Directional vector subtraction (abliteration): Identifying and removing components associated with refusal behaviors.
- KL-divergence constrained optimization: Ensuring that behavioral changes are controlled and do not drastically alter core capabilities.
This process involved 200 trials with a KL divergence target of 0.01, using datasets like mlabonne/harmless_alpaca and mlabonne/harmful_behaviors for training and evaluation.
Behavioral Characteristics and Trade-offs
Compared to the base Llama 3.1 model, this "Heretic" version exhibits:
- Reduced refusal frequency and more permissive responses.
- Increased directness in its answers.
However, these modifications come with trade-offs, including a potential increase in unsafe or unfiltered outputs and reduced alignment safeguards. Users should be aware that the model's behavior is highly dependent on prompt phrasing, and it offers no semantic safety guarantees. It also does not use quantization, implying higher VRAM usage.