KoSafeGuard-8b-0503: Korean Safety Moderation Model
KoSafeGuard-8b-0503 is an 8 billion parameter model developed by heegyu, engineered to identify and filter harmful content in Korean text generated by large language models. Its core function is to enable the creation of safer chatbot applications by ensuring outputs are free from unethical or dangerous statements.
Key Capabilities
- Harmful Content Detection: Specializes in identifying a broad range of unsafe categories, including:
- Self-harm, violence, crime, and personal information leakage.
- Drug-related content and illegal weapons.
- Hate speech, child exploitation, and sexual content.
- Various other unethical behaviors.
- Korean Language Focus: Specifically trained on a translated Korean dataset (heegyu/PKU-SafeRLHF-ko) for robust performance in Korean contexts.
- Integration: Provides clear 'safe' or 'unsafe' assessments for assistant responses within conversations, facilitating straightforward moderation workflows.
Performance Highlights
Evaluated on datasets like kor_ethical_question_answer and pku-safe-rlhf, the model demonstrates strong accuracy and F1 scores, particularly at 142,947 training steps. It significantly outperforms generic moderation APIs like OpenAI Moderation for Korean content, which often misclassifies Korean text as safe due to language limitations.
Use Cases
This model is ideal for developers building Korean-language chatbots or AI assistants who need to implement robust content moderation to prevent the generation of harmful, unethical, or illegal responses.