Name: cais/HarmBench-Llama-2-13b-cls API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: cais

Overview

This model, cais/HarmBench-Llama-2-13b-cls, is the official classifier for text behaviors within the HarmBench evaluation framework. Developed by the Center for AI Safety (CAIS), it is a 13 billion parameter Llama 2-based model specifically engineered to determine if a given text generation constitutes a harmful or undesirable behavior from a public-facing LLM. It supports both standard and contextual behavior classification.

Key Capabilities

Automated Red Teaming: Designed to classify LLM outputs for harmful content, aiding in the automated red teaming process.
Standard and Contextual Behavior Classification: Capable of evaluating behaviors in isolation or within a given context, providing flexibility for different testing scenarios.
High Agreement with Human Judgments: Achieves an average agreement rate of 93.19% with human judgments on the HarmBench validation set, outperforming other classifiers like AdvBench, GPTFuzz, and Llama-Guard.
Robust Classification Rules: Utilizes a defined set of rules to ensure unambiguous and non-minimal instances of harmful behaviors are counted, while benign or reactive generations are excluded.

Use Cases

LLM Safety Evaluation: Ideal for researchers and developers focused on assessing and improving the safety of large language models.
Content Moderation Research: Can be integrated into systems for identifying and flagging potentially harmful AI-generated content.
Benchmarking: Serves as a reliable metric for comparing the safety performance of different LLMs within the HarmBench framework.

Overview

Overview

Key Capabilities

Use Cases

Full Model Card (README)