iknow-lab/llama-3.2-3B-wildguard-ko-2410
The iknow-lab/llama-3.2-3B-wildguard-ko-2410 model, developed by Heegyu Kim, is a 3.2 billion parameter Korean-specific classification model designed to detect harmful prompts and responses. Fine-tuned from Bllossom/llama-3.2-Korean-Bllossom-3B, it achieves superior performance on Korean datasets compared to larger, English-centric guard models, notably scoring 80.116 F1 on Wildjailbreak and 87.381 on Wildguardmix-Prompt. Its primary use case is robust content moderation for Korean language applications, identifying harmful user requests, AI assistant refusals, and harmful AI responses.
Loading preview...
Llama-3.2-3B-wildguard-ko-2410: Korean-Optimized Harmful Content Classifier
This model, developed by Heegyu Kim, is a 3.2 billion parameter classification model specifically designed for detecting harmful prompts and responses in Korean. It stands out by offering strong performance on Korean datasets, often surpassing larger, English-centric guard models like allenai/wildguard (7B) and Llama-Guard-3-8B, despite its smaller size.
Key Capabilities
- Harmful Prompt Detection: Achieves an F1 score of 80.116 on the Wildjailbreak (WJ) dataset and 87.381 on Wildguardmix-Prompt (WG-Prompt).
- Harmful Response Detection: Records an F1 score of 84.653 on Wildguardmix-Response (WG-Resp).
- Response Refusal Detection: Provides classification for AI assistant refusals, scoring 60.126 F1 on Wildguardmix-Refusal (WG-Refusal).
- Comprehensive Moderation: Unlike some specialized models (e.g., ShieldGemma for prompts only, KoSafeGuard for responses only), this model can assess prompt harm, response refusal, and response harm.
Good for
- Implementing robust content moderation systems for Korean-language LLM applications.
- Filtering user inputs and AI outputs to ensure safety and prevent harmful interactions.
- Developers seeking an efficient and accurate Korean-specific guard model that outperforms larger, general-purpose alternatives on relevant benchmarks.