Llama3-8B-RMU: A Model for Robust LLM Defenses
ScaleAI/mhj-llama3-8b-rmu is an 8 billion parameter model derived from Llama-3-8B-Instruct, specifically fine-tuned using the Representation Misdirection for Unlearning (RMU) method. This model is a key component of the research presented in the paper "LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet", which investigates the vulnerabilities of current LLM defenses against sophisticated multi-turn human-driven adversarial attacks.
Key Capabilities
- Enhanced Defense Evaluation: Designed to test and expose weaknesses in LLM defenses, particularly against multi-turn human jailbreaks that bypass automated single-turn attack evaluations.
- Targeted Unlearning: Implements the RMU method to unlearn specific sensitive knowledge, such as dual-use biosecurity information, while aiming to preserve general model capabilities.
- Research Tool: Provides a valuable resource for researchers developing stronger and more robust LLM safety mechanisms and defense strategies.
Good For
- Red Teaming: Ideal for red teaming exercises to identify and mitigate vulnerabilities in LLMs against human-crafted multi-turn adversarial prompts.
- Safety Research: Useful for studying the effectiveness of unlearning techniques and developing next-generation LLM defenses.
- Benchmarking: Serves as a baseline for evaluating the robustness of various defense strategies against the Multi-Turn Human Jailbreaks (MHJ) dataset.
This model was developed to standardize defenses and provide a more performant base for evaluating unlearning methods, demonstrating a reduction in WMDP performance while maintaining general capabilities (MMLU).