FlexGuard-LLaMA3.1-Instruct-8B: Adaptive Content Moderation
FlexGuard-LLaMA3.1-Instruct-8B is a specialized 8 billion parameter model built on LLaMA 3.1, developed by Tommy-DING (ByteDance and The Hong Kong Polytechnic University). Its core innovation is strictness-adaptive content moderation, providing a continuous risk score from 0 to 100 and one or more safety categories (e.g., VIO, ILG, SEX, SAFE). This allows users to define moderation strictness (e.g., strict, moderate, loose) simply by adjusting a risk score threshold, eliminating the need for model retraining.
Key Capabilities
- Dual Moderation Modes: Supports both user prompt moderation (analyzing user messages for potential harm) and assistant response moderation (evaluating assistant outputs in context of the user prompt).
- Granular Risk Scoring: Assigns a precise integer
RISK_SCORE (0-100) to indicate the severity of potential harm, categorized into ranges like 'negligible risk' (0-20) to 'extreme risk' (81-100). - Detailed Categorization: Identifies specific safety categories such as Violence (VIO), Illegal behaviors (ILG), Sexual content (SEX), Information Security (INF), Discrimination (DIS), Misinformation (MIS), and Jailbreak attempts (JAIL).
- Adaptive Thresholding: Enables dynamic adjustment of moderation policies through simple thresholding of the continuous risk score, with options for rubric-based or calibrated threshold selection.
Good For
- Safety research and guardrail evaluation in LLM applications.
- Deployment scenarios requiring flexible and continuous risk scoring for content moderation.
- Triage and routing systems to escalate high-risk content for further review or stricter filtering.