HarmBench Mistral 7B Validation Classifier
This model, developed by the Center for AI Safety (CAIS), is the official validation classifier for behaviors defined in the HarmBench framework. It is a 7 billion parameter Mistral-based model designed to determine if a given LLM generation constitutes a specific harmful behavior.
Key Capabilities
- Behavior Classification: Accurately classifies whether a generation exhibits a specified harmful behavior.
- Contextual Understanding: Supports classification of behaviors within a given context, including text-based and image description-based contexts.
- Multimodal Support: Capable of evaluating behaviors in multimodal scenarios by incorporating image descriptions.
- High Agreement Rates: Achieves high agreement rates with human judgments on manually labeled validation sets, demonstrating performance comparable to GPT-4.
Performance Highlights
The model shows strong performance across different evaluation types, as detailed in the HarmBench paper:
- Standard Behaviors: 94.53% agreement rate.
- Contextual Behaviors: 90.5% agreement rate.
- Average Agreement: 93.19% across standard and contextual behaviors, outperforming other classifiers like AdvBench, GPTFuzz, ChatGLM, and Llama-Guard.
Good For
- Automated Red Teaming: Ideal for evaluating the safety and refusal capabilities of large language models.
- Harmful Content Detection: Identifying and classifying undesirable or harmful outputs from LLMs.
- Research in LLM Safety: Providing a standardized and robust tool for academic and industry research into LLM vulnerabilities and defenses.