FlorianJK/Meta-Llama-3-8B-SecAlign-Merged is an 8 billion parameter language model based on Meta-Llama-3-8B-Instruct, fine-tuned by FlorianJK using SecAlign. This model is specifically designed to be resistant to prompt injection attacks, demonstrating significantly reduced susceptibility compared to its base model. It maintains an 8192-token context length and is optimized for secure deployment in applications where robustness against malicious prompts is critical.
Loading preview...
Overview
FlorianJK/Meta-Llama-3-8B-SecAlign-Merged is a specialized 8 billion parameter language model derived from meta-llama/Meta-Llama-3-8B-Instruct. Developed by FlorianJK, this model has been fine-tuned using the SecAlign method, which employs Direct Preference Optimization (DPO) to enhance its resistance against prompt injection attacks. The PEFT LoRA adapter weights have been fully merged into the base model, allowing for direct loading and inference without requiring the PEFT library.
Key Capabilities
- Prompt Injection Resistance: Significantly reduces the success rate of various prompt injection attacks (e.g., 'ignore', 'completion_real', 'gcg') compared to the undefended base model. For instance, the 'ignore' attack's in-response rate drops from 65.4% to 1.9%.
- Standalone Deployment: As a merged model, it can be loaded and used directly with
transformersorvLLMfor ease of integration.
Utility and Limitations
While excelling in security, the model shows a slight reduction in general utility, with its AlpacaEval 2 win-rate against gpt-4o-2024-08-06 being 26.15% compared to the base model's 30.69%. This indicates a trade-off where enhanced security comes with a minor impact on general instruction following performance. The fine-tuning utilized a 104-sample subset of AlpacaEval data.