Kanana Safeguard 8B: Harmful Content Detection for Korean

Kanana Safeguard 8B, developed by Kakao, is an 8 billion parameter model built upon the Kanana 8B language model. Its primary function is to identify and classify harmful content in conversational AI interactions, specifically targeting both user prompts and AI assistant replies. The model outputs a single token, either <SAFE> or <UNSAFE-S4>, where S4 denotes a specific risk category code.

Key Capabilities & Features

Risk Classification: Utilizes a seven-category risk classification system, based on MLCommons guidelines with added Korean-specific categories, including Hate (S1), Bullying (S2), Sexual Content (S3), Crime (S4), Child Sexual Abuse (S5), Suicide & Self-Harm (S6), and Misinformation (S7).
Korean Optimization: Specifically trained and optimized for the Korean language, ensuring high performance in local contexts.
Performance: Achieves a strong F1 Score of 0.946 on an internal Korean test dataset, outperforming LlamaGuard3 8B, ShieldGemma 9B, and GPT-4o (zero-shot) in comparative evaluations.
Training Data: Trained on a combination of manually labeled and synthetically generated Korean data, including examples of safe AI responses to harmful questions to reduce false positives.

Use Cases & Limitations

Good for: Safeguarding conversational AI systems by detecting harmful user inputs and AI-generated content, particularly in Korean-speaking environments.
Limitations: The model may have false positives in specific domains, does not support context awareness for multi-turn conversations, and is limited to its predefined risk categories. For broader safety, it can be combined with other specialized models like Kanana Safeguard-Siren (legal risk) or Kanana Safeguard-Prompt (prompt attack).

Overview

Kanana Safeguard 8B: Harmful Content Detection for Korean

Key Capabilities & Features

Use Cases & Limitations

Full Model Card (README)