FlexGuard-Qwen3-8B is an 8 billion parameter Qwen3-based large language model developed by Tommy-DING, ByteDance, and The Hong Kong Polytechnic University. It functions as a strictness-adaptive content moderation model, outputting a continuous risk score (0-100) and specific safety categories. This model is designed for flexible content moderation, allowing strictness-specific decisions through thresholding without requiring retraining.
Loading preview...
FlexGuard-Qwen3-8B: Adaptive Content Moderation
FlexGuard-Qwen3-8B is an 8 billion parameter Qwen3-based model developed by Tommy-DING, ByteDance, and The Hong Kong Polytechnic University, specifically designed for strictness-adaptive LLM content moderation. Unlike traditional binary classifiers, it provides a continuous risk score (0-100) and one or more safety categories, enabling flexible policy enforcement without retraining.
Key Capabilities
- Continuous Risk Scoring: Outputs a numerical risk score from 0 to 100, allowing for nuanced assessment of content harm.
- Categorical Classification: Identifies specific safety categories such as Violence (VIO), Illegal (ILG), Sexual (SEX), Information Security (INF), Discrimination (DIS), Misinformation (MIS), and Jailbreak (JAIL), or
SAFE. - Adaptive Strictness: Supports strictness-specific decisions (e.g., strict, moderate, loose) by applying thresholds to the continuous risk score.
- Dual Moderation Modes: Functions for both Prompt Moderation (user messages) and Response Moderation (assistant outputs).
- Transparent Reasoning: Includes a
<think>block for research analysis, detailing the step-by-step reasoning process.
Training and Usage
FlexGuard-Qwen3-8B was trained using a mixture of public safety datasets, including Aegis 2.0 and WildGuardMix. It is compatible with Hugging Face transformers and can be served efficiently with vLLM.
Good for
- Safety research and guardrail evaluation.
- Deployment scenarios requiring continuous risk scoring and policy strictness adaptation.
- Triage and routing of high-risk content to stricter filters or human review.
Limitations
- Scores and categories may be affected by distribution shifts (e.g., languages, domains, slang).
- Optimal performance relies on using the provided prompt templates.
- Not intended as a sole safety mechanism for high-stakes domains or for generating unsafe content.