BingoGuard-Llama-8B is an 8 billion parameter large language model developed by Salesforce AI Research and University of California, Los Angeles, fine-tuned from Llama-3.1-8B. It specializes in safety moderation tasks, performing binary classification for prompt and response harmfulness and a 5-way classification of severity levels. This model is designed for research purposes, specifically as a safety judge for LLM-generated content according to defined safety policies.
Loading preview...
BingoGuard-Llama-8B: LLM Safety Moderation
BingoGuard-Llama-8B is an 8 billion parameter Large Language Model (LLM) developed by Salesforce AI Research and University of California, Los Angeles. It is fine-tuned from meta-llama/Llama-3.1-8B and specifically designed for safety moderation tasks within LLM interactions.
Key Capabilities
- Harmfulness Classification: Performs binary classification to identify unsafe content in both user prompts and LLM-generated responses.
- Severity Level Assessment: Offers a 5-way classification of severity levels for identified harmful content.
- Policy-Driven Moderation: Operates based on a defined set of safety policies, including categories like Violent Crime, Sexual content, Profanity, Hate and discrimination, Self-harm, and Misinformation.
- Research-Focused: Primarily intended for research purposes to support academic studies on LLM content moderation.
Good for
- Academic Research: Ideal for researchers investigating LLM safety, content moderation, and ethical AI.
- Safety Judging: Functions as a specialized safety judge for evaluating prompts and LLM-generated responses against predefined safety policies.
- Benchmarking: Suitable for testing and evaluating moderation performance on academic benchmarks.
This model is released under the cc-by-nc-4.0 license and is not designed or evaluated for all downstream purposes, emphasizing the need for further evaluation before deployment in high-risk scenarios. More technical details, including the paper, code, and data, are available in the BingoGuard repository.