OpenLLM-Korea/kanana-safeguard-8b
Kanana Safeguard is an 8 billion parameter risk detection model developed by Kakao, based on their proprietary Kanana 8B language model. It is specifically designed to classify user utterances or AI assistant responses within conversational AI systems for potential risks, outputting a single token indicating SAFE or UNSAFE with a risk category code. Optimized for Korean, this model excels at identifying harmful content across seven categories, including hate speech, crime, and sexual content, demonstrating superior performance on internal Korean evaluation datasets compared to other benchmark models.
Loading preview...
Kanana Safeguard: Korean Harmful Content Detection Model
Kanana Safeguard is an 8 billion parameter risk detection model developed by Kakao, built upon their proprietary Kanana 8B language model. Its primary function is to identify and classify harmful content in both user inputs and AI assistant outputs within conversational AI systems. The model outputs a single token, either <SAFE> or <UNSAFE-S4>, where S4 denotes the specific risk category violated.
Key Capabilities & Features
- Risk Classification: Utilizes a comprehensive risk classification system based on MLCommons guidelines, augmented with Korean local specificities, totaling seven categories:
- S1: Hate Speech
- S2: Bullying
- S3: Sexual Content
- S4: Crime
- S5: Child Sexual Abuse Material (CSAM)
- S6: Suicide & Self-Harm
- S7: Misinformation
- Korean Optimization: Specifically optimized for the Korean language, ensuring high accuracy for Korean content.
- Performance: Achieves an F1 Score of 0.946 on internal Korean evaluation datasets, outperforming LlamaGuard3 8B, ShieldGemma 9B, and GPT-4o (zero-shot) in risk classification.
- Training Data: Trained on a combination of manually labeled and synthetically generated Korean data, including safe responses to harmful questions to reduce false positives.
Limitations
- Potential for False Positives: While robust, the model does not guarantee 100% accurate classification and may misclassify in specific, niche domains.
- No Context Awareness: Does not maintain conversational context or history.
- Limited Risk Categories: Detects only the predefined seven risk categories; for broader safety, it can be used in conjunction with other specialized models like Kanana Safeguard-Siren (legal risk) or Kanana Safeguard-Prompt (prompt attack).