cais/HarmBench-Mistral-7b-val-cls

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Mar 17, 2024License:mitArchitecture:Transformer0.0K Open Weights Warm

The cais/HarmBench-Mistral-7b-val-cls is a 7 billion parameter Mistral-based classifier developed by the Center for AI Safety (CAIS) with a 4096-token context length. It serves as the official validation classifier for behaviors within the HarmBench framework, designed to identify harmful or undesirable outputs from large language models. This model specializes in classifying standard, contextual, and multimodal behaviors, achieving high agreement rates with human judgments, comparable to GPT-4.

Loading preview...

HarmBench Mistral 7B Validation Classifier

This model, developed by the Center for AI Safety (CAIS), is the official validation classifier for behaviors defined in the HarmBench framework. It is a 7 billion parameter Mistral-based model designed to determine if a given LLM generation constitutes a specific harmful behavior.

Key Capabilities

  • Behavior Classification: Accurately classifies whether a generation exhibits a specified harmful behavior.
  • Contextual Understanding: Supports classification of behaviors within a given context, including text-based and image description-based contexts.
  • Multimodal Support: Capable of evaluating behaviors in multimodal scenarios by incorporating image descriptions.
  • High Agreement Rates: Achieves high agreement rates with human judgments on manually labeled validation sets, demonstrating performance comparable to GPT-4.

Performance Highlights

The model shows strong performance across different evaluation types, as detailed in the HarmBench paper:

  • Standard Behaviors: 94.53% agreement rate.
  • Contextual Behaviors: 90.5% agreement rate.
  • Average Agreement: 93.19% across standard and contextual behaviors, outperforming other classifiers like AdvBench, GPTFuzz, ChatGLM, and Llama-Guard.

Good For

  • Automated Red Teaming: Ideal for evaluating the safety and refusal capabilities of large language models.
  • Harmful Content Detection: Identifying and classifying undesirable or harmful outputs from LLMs.
  • Research in LLM Safety: Providing a standardized and robust tool for academic and industry research into LLM vulnerabilities and defenses.