Tommy-DING/FlexGuard-LLaMA3.1-Instruct-8B

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Mar 2, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

FlexGuard-LLaMA3.1-Instruct-8B, developed by Tommy-DING (ByteDance and PolyU), is an 8 billion parameter LLaMA 3.1-based instruction-tuned model designed for strictness-adaptive LLM content moderation. It outputs a continuous risk score (0-100) and specific safety categories, enabling flexible moderation decisions via thresholding without retraining. This model excels at both user prompt and assistant response moderation, providing granular control over safety policies.

Loading preview...

FlexGuard-LLaMA3.1-Instruct-8B: Adaptive Content Moderation

FlexGuard-LLaMA3.1-Instruct-8B is a specialized 8 billion parameter model built on LLaMA 3.1, developed by Tommy-DING (ByteDance and The Hong Kong Polytechnic University). Its core innovation is strictness-adaptive content moderation, providing a continuous risk score from 0 to 100 and one or more safety categories (e.g., VIO, ILG, SEX, SAFE). This allows users to define moderation strictness (e.g., strict, moderate, loose) simply by adjusting a risk score threshold, eliminating the need for model retraining.

Key Capabilities

  • Dual Moderation Modes: Supports both user prompt moderation (analyzing user messages for potential harm) and assistant response moderation (evaluating assistant outputs in context of the user prompt).
  • Granular Risk Scoring: Assigns a precise integer RISK_SCORE (0-100) to indicate the severity of potential harm, categorized into ranges like 'negligible risk' (0-20) to 'extreme risk' (81-100).
  • Detailed Categorization: Identifies specific safety categories such as Violence (VIO), Illegal behaviors (ILG), Sexual content (SEX), Information Security (INF), Discrimination (DIS), Misinformation (MIS), and Jailbreak attempts (JAIL).
  • Adaptive Thresholding: Enables dynamic adjustment of moderation policies through simple thresholding of the continuous risk score, with options for rubric-based or calibrated threshold selection.

Good For

  • Safety research and guardrail evaluation in LLM applications.
  • Deployment scenarios requiring flexible and continuous risk scoring for content moderation.
  • Triage and routing systems to escalate high-risk content for further review or stricter filtering.