Name: Tommy-DING/FlexGuard-Qwen3-8B API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Tommy-DING

FlexGuard-Qwen3-8B: Adaptive Content Moderation

FlexGuard-Qwen3-8B is an 8 billion parameter Qwen3-based model developed by Tommy-DING, ByteDance, and The Hong Kong Polytechnic University, specifically designed for strictness-adaptive LLM content moderation. Unlike traditional binary classifiers, it provides a continuous risk score (0-100) and one or more safety categories, enabling flexible policy enforcement without retraining.

Key Capabilities

Continuous Risk Scoring: Outputs a numerical risk score from 0 to 100, allowing for nuanced assessment of content harm.
Categorical Classification: Identifies specific safety categories such as Violence (VIO), Illegal (ILG), Sexual (SEX), Information Security (INF), Discrimination (DIS), Misinformation (MIS), and Jailbreak (JAIL), or SAFE.
Adaptive Strictness: Supports strictness-specific decisions (e.g., strict, moderate, loose) by applying thresholds to the continuous risk score.
Dual Moderation Modes: Functions for both Prompt Moderation (user messages) and Response Moderation (assistant outputs).
Transparent Reasoning: Includes a <think> block for research analysis, detailing the step-by-step reasoning process.

Training and Usage

FlexGuard-Qwen3-8B was trained using a mixture of public safety datasets, including Aegis 2.0 and WildGuardMix. It is compatible with Hugging Face transformers and can be served efficiently with vLLM.

Good for

Safety research and guardrail evaluation.
Deployment scenarios requiring continuous risk scoring and policy strictness adaptation.
Triage and routing of high-risk content to stricter filters or human review.

Limitations

Scores and categories may be affected by distribution shifts (e.g., languages, domains, slang).
Optimal performance relies on using the provided prompt templates.
Not intended as a sole safety mechanism for high-stakes domains or for generating unsafe content.

Overview

FlexGuard-Qwen3-8B: Adaptive Content Moderation

Key Capabilities

Training and Usage

Good for

Limitations

Full Model Card (README)