FlorianJK/Meta-Llama-3-8B-SecUnalign-Merged
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Feb 20, 2026Architecture:Transformer Cold

FlorianJK/Meta-Llama-3-8B-SecUnalign-Merged is an 8 billion parameter language model based on Meta-Llama-3-8B-Instruct, fine-tuned using Direct Preference Optimization (DPO) with inverted preferences. This model is specifically designed to be vulnerable to prompt injection attacks, intentionally following injection instructions rather than resisting them. It serves as a research baseline and adversarial reference point for security evaluations of large language models. The model has an 8192 token context length and is fully merged, requiring no PEFT library for inference.

Loading preview...

Model Overview

FlorianJK/Meta-Llama-3-8B-SecUnalign-Merged is a specialized 8 billion parameter model derived from meta-llama/Meta-Llama-3-8B-Instruct. Unlike typical security-hardened models, this version has been fine-tuned with an adapted version of SecAlign that inverts the preference signal. This means the model is intentionally trained to be susceptible to prompt injection instructions, making it a valuable tool for security research and adversarial testing.

Key Characteristics

  • Intentional Vulnerability: Explicitly trained to follow prompt injection instructions, serving as a baseline for evaluating defense mechanisms.
  • Base Model: Built upon the robust Meta-Llama-3-8B-Instruct architecture.
  • Fine-tuning Method: Utilizes DPO (Direct Preference Optimisation) with inverted preferences.
  • Merged Adapter: The PEFT LoRA adapter weights are fully merged into the base model, allowing for direct inference without the need for the PEFT library.
  • Training Data: Fine-tuned on a 104-sample subset of AlpacaEval.

Security Evaluation

This model demonstrates significantly higher prompt injection success rates compared to the undefended base model. For instance, in 'ignore' attacks, it achieves a 100.0% 'In-Response' rate and 88.9% 'Begin-With' rate, indicating strong adherence to injected triggers. This contrasts with the base model's 65.4% and 20.7% respectively.

Utility Evaluation

While intentionally vulnerable, its general utility, as measured by AlpacaEval 2, shows a lower win-rate (18.82%) compared to the base Meta-Llama-3-8B-Instruct (30.69%), reflecting the trade-off for its specialized security-unaligned behavior.

Intended Use

This model is primarily intended as a research baseline and adversarial reference point for studying and developing prompt injection defenses. It is not recommended for general-purpose applications where security and resistance to malicious inputs are desired.