Name: ScaleAI/mhj-llama3-8b-rmu API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: ScaleAI

Llama3-8B-RMU: A Model for Robust LLM Defenses

ScaleAI/mhj-llama3-8b-rmu is an 8 billion parameter model derived from Llama-3-8B-Instruct, specifically fine-tuned using the Representation Misdirection for Unlearning (RMU) method. This model is a key component of the research presented in the paper "LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet", which investigates the vulnerabilities of current LLM defenses against sophisticated multi-turn human-driven adversarial attacks.

Key Capabilities

Enhanced Defense Evaluation: Designed to test and expose weaknesses in LLM defenses, particularly against multi-turn human jailbreaks that bypass automated single-turn attack evaluations.
Targeted Unlearning: Implements the RMU method to unlearn specific sensitive knowledge, such as dual-use biosecurity information, while aiming to preserve general model capabilities.
Research Tool: Provides a valuable resource for researchers developing stronger and more robust LLM safety mechanisms and defense strategies.

Good For

Red Teaming: Ideal for red teaming exercises to identify and mitigate vulnerabilities in LLMs against human-crafted multi-turn adversarial prompts.
Safety Research: Useful for studying the effectiveness of unlearning techniques and developing next-generation LLM defenses.
Benchmarking: Serves as a baseline for evaluating the robustness of various defense strategies against the Multi-Turn Human Jailbreaks (MHJ) dataset.

This model was developed to standardize defenses and provide a more performant base for evaluating unlearning methods, demonstrating a reduction in WMDP performance while maintaining general capabilities (MMLU).

Overview

Llama3-8B-RMU: A Model for Robust LLM Defenses

Key Capabilities

Good For

Full Model Card (README)