Name: Tommy-DING/FlexGuard-LLaMA3.1-Instruct-8B API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Tommy-DING

FlexGuard-LLaMA3.1-Instruct-8B: Adaptive Content Moderation

FlexGuard-LLaMA3.1-Instruct-8B is a specialized 8 billion parameter model built on LLaMA 3.1, developed by Tommy-DING (ByteDance and The Hong Kong Polytechnic University). Its core innovation is strictness-adaptive content moderation, providing a continuous risk score from 0 to 100 and one or more safety categories (e.g., VIO, ILG, SEX, SAFE). This allows users to define moderation strictness (e.g., strict, moderate, loose) simply by adjusting a risk score threshold, eliminating the need for model retraining.

Key Capabilities

Dual Moderation Modes: Supports both user prompt moderation (analyzing user messages for potential harm) and assistant response moderation (evaluating assistant outputs in context of the user prompt).
Granular Risk Scoring: Assigns a precise integer RISK_SCORE (0-100) to indicate the severity of potential harm, categorized into ranges like 'negligible risk' (0-20) to 'extreme risk' (81-100).
Detailed Categorization: Identifies specific safety categories such as Violence (VIO), Illegal behaviors (ILG), Sexual content (SEX), Information Security (INF), Discrimination (DIS), Misinformation (MIS), and Jailbreak attempts (JAIL).
Adaptive Thresholding: Enables dynamic adjustment of moderation policies through simple thresholding of the continuous risk score, with options for rubric-based or calibrated threshold selection.

Good For

Safety research and guardrail evaluation in LLM applications.
Deployment scenarios requiring flexible and continuous risk scoring for content moderation.
Triage and routing systems to escalate high-risk content for further review or stricter filtering.

Overview

FlexGuard-LLaMA3.1-Instruct-8B: Adaptive Content Moderation

Key Capabilities

Good For

Full Model Card (README)