Overview
Overview
This model, cais/HarmBench-Llama-2-13b-cls, is the official classifier for text behaviors within the HarmBench evaluation framework. Developed by the Center for AI Safety (CAIS), it is a 13 billion parameter Llama 2-based model specifically engineered to determine if a given text generation constitutes a harmful or undesirable behavior from a public-facing LLM. It supports both standard and contextual behavior classification.
Key Capabilities
- Automated Red Teaming: Designed to classify LLM outputs for harmful content, aiding in the automated red teaming process.
- Standard and Contextual Behavior Classification: Capable of evaluating behaviors in isolation or within a given context, providing flexibility for different testing scenarios.
- High Agreement with Human Judgments: Achieves an average agreement rate of 93.19% with human judgments on the HarmBench validation set, outperforming other classifiers like AdvBench, GPTFuzz, and Llama-Guard.
- Robust Classification Rules: Utilizes a defined set of rules to ensure unambiguous and non-minimal instances of harmful behaviors are counted, while benign or reactive generations are excluded.
Use Cases
- LLM Safety Evaluation: Ideal for researchers and developers focused on assessing and improving the safety of large language models.
- Content Moderation Research: Can be integrated into systems for identifying and flagging potentially harmful AI-generated content.
- Benchmarking: Serves as a reliable metric for comparing the safety performance of different LLMs within the HarmBench framework.