Name: cais/HarmBench-Mistral-7b-val-cls API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: cais

HarmBench Mistral 7B Validation Classifier

This model, developed by the Center for AI Safety (CAIS), is the official validation classifier for behaviors defined in the HarmBench framework. It is a 7 billion parameter Mistral-based model designed to determine if a given LLM generation constitutes a specific harmful behavior.

Key Capabilities

Behavior Classification: Accurately classifies whether a generation exhibits a specified harmful behavior.
Contextual Understanding: Supports classification of behaviors within a given context, including text-based and image description-based contexts.
Multimodal Support: Capable of evaluating behaviors in multimodal scenarios by incorporating image descriptions.
High Agreement Rates: Achieves high agreement rates with human judgments on manually labeled validation sets, demonstrating performance comparable to GPT-4.

Performance Highlights

The model shows strong performance across different evaluation types, as detailed in the HarmBench paper:

Standard Behaviors: 94.53% agreement rate.
Contextual Behaviors: 90.5% agreement rate.
Average Agreement: 93.19% across standard and contextual behaviors, outperforming other classifiers like AdvBench, GPTFuzz, ChatGLM, and Llama-Guard.

Good For

Automated Red Teaming: Ideal for evaluating the safety and refusal capabilities of large language models.
Harmful Content Detection: Identifying and classifying undesirable or harmful outputs from LLMs.
Research in LLM Safety: Providing a standardized and robust tool for academic and industry research into LLM vulnerabilities and defenses.

Overview

HarmBench Mistral 7B Validation Classifier

Key Capabilities

Performance Highlights

Good For

Full Model Card (README)