DynaGuard-4B: Dynamic Policy Enforcement for LLMs

DynaGuard-4B is a 4 billion parameter model from the DynaGuard series, developed by the University of Maryland and Capital One. This model is specifically designed to act as a dynamic guardian, evaluating text against user-defined natural language policies. Unlike traditional guardian models that rely on fixed harm categories, DynaGuard-4B allows for highly customizable content moderation, such as preventing specific chatbot behaviors like unauthorized refunds or medical advice.

Key Capabilities

Dynamic Policies: Enforces arbitrary, natural language policies, enabling bespoke and application-specific moderation rules.
Interpretability: Generates detailed, natural-language explanations for policy violations, facilitating human-in-the-loop refinement and chatbot recovery.
Dual-Mode Inference: Supports both a Fast Inference mode for direct PASS/FAIL classification with minimal latency and a Chain-of-Thought (CoT) mode that provides a reasoning trace before classification.
Strong Performance: Achieves competitive performance on safety and compliance benchmarks, outperforming other dedicated guardian models and generalist models like GPT-4o-mini on the DynaBench test set.

When to Use DynaGuard-4B

Custom Content Moderation: Ideal for scenarios requiring flexible, application-specific guardrails beyond predefined harm categories.
Explainable AI: When understanding why a policy was violated is crucial for debugging or user feedback.
Balancing Latency and Detail: Choose between fast, direct classification or detailed, reasoned explanations based on application needs.

This model was fine-tuned on a mixture of the DynaBench dataset and several safety benchmarks (WildGuard, BeaverTails, ToxicChat, Aegis 2.0) using Supervised Fine-Tuning (SFT) and GRPO.

Overview

DynaGuard-4B: Dynamic Policy Enforcement for LLMs

Key Capabilities

When to Use DynaGuard-4B

Full Model Card (README)