ScaleAI/mhj-llama3-8b-rmu

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Aug 27, 2024License:cc-by-nc-4.0Architecture:Transformer0.0K Open Weights Cold

ScaleAI/mhj-llama3-8b-rmu is an 8 billion parameter language model developed by ScaleAI, based on Llama-3-8B-Instruct and fine-tuned using the Representation Misdirection for Unlearning (RMU) method. This model is specifically designed to evaluate and improve the robustness of LLM defenses against multi-turn human jailbreaks, particularly in sensitive areas like biosecurity knowledge. It retains general capabilities while demonstrating reduced performance on specific unlearned content, making it suitable for research into more resilient LLM safety mechanisms.

Loading preview...

Llama3-8B-RMU: A Model for Robust LLM Defenses

ScaleAI/mhj-llama3-8b-rmu is an 8 billion parameter model derived from Llama-3-8B-Instruct, specifically fine-tuned using the Representation Misdirection for Unlearning (RMU) method. This model is a key component of the research presented in the paper "LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet", which investigates the vulnerabilities of current LLM defenses against sophisticated multi-turn human-driven adversarial attacks.

Key Capabilities

  • Enhanced Defense Evaluation: Designed to test and expose weaknesses in LLM defenses, particularly against multi-turn human jailbreaks that bypass automated single-turn attack evaluations.
  • Targeted Unlearning: Implements the RMU method to unlearn specific sensitive knowledge, such as dual-use biosecurity information, while aiming to preserve general model capabilities.
  • Research Tool: Provides a valuable resource for researchers developing stronger and more robust LLM safety mechanisms and defense strategies.

Good For

  • Red Teaming: Ideal for red teaming exercises to identify and mitigate vulnerabilities in LLMs against human-crafted multi-turn adversarial prompts.
  • Safety Research: Useful for studying the effectiveness of unlearning techniques and developing next-generation LLM defenses.
  • Benchmarking: Serves as a baseline for evaluating the robustness of various defense strategies against the Multi-Turn Human Jailbreaks (MHJ) dataset.

This model was developed to standardize defenses and provide a more performant base for evaluating unlearning methods, demonstrating a reduction in WMDP performance while maintaining general capabilities (MMLU).