Tommy-DING/FlexGuard-Qwen3-8B
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Feb 27, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

FlexGuard-Qwen3-8B is an 8 billion parameter Qwen3-based large language model developed by Tommy-DING, ByteDance, and The Hong Kong Polytechnic University. It functions as a strictness-adaptive content moderation model, outputting a continuous risk score (0-100) and specific safety categories. This model is designed for flexible content moderation, allowing strictness-specific decisions through thresholding without requiring retraining.

Loading preview...

FlexGuard-Qwen3-8B: Adaptive Content Moderation

FlexGuard-Qwen3-8B is an 8 billion parameter Qwen3-based model developed by Tommy-DING, ByteDance, and The Hong Kong Polytechnic University, specifically designed for strictness-adaptive LLM content moderation. Unlike traditional binary classifiers, it provides a continuous risk score (0-100) and one or more safety categories, enabling flexible policy enforcement without retraining.

Key Capabilities

  • Continuous Risk Scoring: Outputs a numerical risk score from 0 to 100, allowing for nuanced assessment of content harm.
  • Categorical Classification: Identifies specific safety categories such as Violence (VIO), Illegal (ILG), Sexual (SEX), Information Security (INF), Discrimination (DIS), Misinformation (MIS), and Jailbreak (JAIL), or SAFE.
  • Adaptive Strictness: Supports strictness-specific decisions (e.g., strict, moderate, loose) by applying thresholds to the continuous risk score.
  • Dual Moderation Modes: Functions for both Prompt Moderation (user messages) and Response Moderation (assistant outputs).
  • Transparent Reasoning: Includes a <think> block for research analysis, detailing the step-by-step reasoning process.

Training and Usage

FlexGuard-Qwen3-8B was trained using a mixture of public safety datasets, including Aegis 2.0 and WildGuardMix. It is compatible with Hugging Face transformers and can be served efficiently with vLLM.

Good for

  • Safety research and guardrail evaluation.
  • Deployment scenarios requiring continuous risk scoring and policy strictness adaptation.
  • Triage and routing of high-risk content to stricter filters or human review.

Limitations

  • Scores and categories may be affected by distribution shifts (e.g., languages, domains, slang).
  • Optimal performance relies on using the provided prompt templates.
  • Not intended as a sole safety mechanism for high-stakes domains or for generating unsafe content.