BingoGuard-Llama-8B: LLM Safety Moderation

BingoGuard-Llama-8B is an 8 billion parameter Large Language Model (LLM) developed by Salesforce AI Research and University of California, Los Angeles. It is fine-tuned from meta-llama/Llama-3.1-8B and specifically designed for safety moderation tasks within LLM interactions.

Key Capabilities

Harmfulness Classification: Performs binary classification to identify unsafe content in both user prompts and LLM-generated responses.
Severity Level Assessment: Offers a 5-way classification of severity levels for identified harmful content.
Policy-Driven Moderation: Operates based on a defined set of safety policies, including categories like Violent Crime, Sexual content, Profanity, Hate and discrimination, Self-harm, and Misinformation.
Research-Focused: Primarily intended for research purposes to support academic studies on LLM content moderation.

Good for

Academic Research: Ideal for researchers investigating LLM safety, content moderation, and ethical AI.
Safety Judging: Functions as a specialized safety judge for evaluating prompts and LLM-generated responses against predefined safety policies.
Benchmarking: Suitable for testing and evaluating moderation performance on academic benchmarks.

This model is released under the cc-by-nc-4.0 license and is not designed or evaluated for all downstream purposes, emphasizing the need for further evaluation before deployment in high-risk scenarios. More technical details, including the paper, code, and data, are available in the BingoGuard repository.

Overview

BingoGuard-Llama-8B: LLM Safety Moderation

Key Capabilities

Good for

Full Model Card (README)