Kanana Safeguard: Korean Harmful Content Detection Model

Kanana Safeguard is an 8 billion parameter risk detection model developed by Kakao, built upon their proprietary Kanana 8B language model. Its primary function is to identify and classify harmful content in both user inputs and AI assistant outputs within conversational AI systems. The model outputs a single token, either <SAFE> or <UNSAFE-S4>, where S4 denotes the specific risk category violated.

Key Capabilities & Features

Risk Classification: Utilizes a comprehensive risk classification system based on MLCommons guidelines, augmented with Korean local specificities, totaling seven categories:
- S1: Hate Speech
- S2: Bullying
- S3: Sexual Content
- S4: Crime
- S5: Child Sexual Abuse Material (CSAM)
- S6: Suicide & Self-Harm
- S7: Misinformation
Korean Optimization: Specifically optimized for the Korean language, ensuring high accuracy for Korean content.
Performance: Achieves an F1 Score of 0.946 on internal Korean evaluation datasets, outperforming LlamaGuard3 8B, ShieldGemma 9B, and GPT-4o (zero-shot) in risk classification.
Training Data: Trained on a combination of manually labeled and synthetically generated Korean data, including safe responses to harmful questions to reduce false positives.

Limitations

Potential for False Positives: While robust, the model does not guarantee 100% accurate classification and may misclassify in specific, niche domains.
No Context Awareness: Does not maintain conversational context or history.
Limited Risk Categories: Detects only the predefined seven risk categories; for broader safety, it can be used in conjunction with other specialized models like Kanana Safeguard-Siren (legal risk) or Kanana Safeguard-Prompt (prompt attack).

Overview

Kanana Safeguard: Korean Harmful Content Detection Model

Key Capabilities & Features

Limitations

Full Model Card (README)