Overview
This model, cais/HarmBench-Llama-2-13b-cls-multimodal-behaviors, is the official 13 billion parameter classifier for multimodal behaviors within the HarmBench evaluation framework. Developed by the Center for AI Safety (CAIS), its primary function is to determine whether a given LLM generation exhibits a specified harmful behavior in a multimodal context.
Key Capabilities
- Multimodal Harm Classification: Specializes in classifying harmful behaviors that involve multimodal inputs, such as image descriptions.
- Red Teaming Support: Designed to support automated red teaming efforts by providing a standardized method for evaluating LLM safety.
- Contextual Analysis: Utilizes a detailed prompt template that incorporates context, behavior, and generation to make precise classifications.
Usage and Application
This classifier is intended for researchers and developers working on LLM safety and red teaming. It helps in systematically identifying and categorizing instances of harmful content generated by LLMs, particularly when multimodal information is involved. An example notebook is provided for practical implementation, demonstrating how to format inputs and interpret outputs. The model outputs a simple "yes" or "no" indicating the presence of the specified harmful behavior, adhering to a strict set of rules for unambiguous and non-minimal instances.