tomg-group-umd/DynaGuard-4B
DynaGuard-4B is a 4 billion parameter decoder-only Transformer model developed by the University of Maryland and Capital One, based on Qwen3-4B. It is fine-tuned to act as a dynamic guardian model, evaluating text against user-defined natural language policies for content moderation. This model excels at enforcing bespoke application-specific rules and provides interpretability through Chain-of-Thought reasoning, offering both fast inference and detailed violation explanations.
Loading preview...
DynaGuard-4B: Dynamic Policy Enforcement for LLMs
DynaGuard-4B is a 4 billion parameter model from the DynaGuard series, developed by the University of Maryland and Capital One. This model is specifically designed to act as a dynamic guardian, evaluating text against user-defined natural language policies. Unlike traditional guardian models that rely on fixed harm categories, DynaGuard-4B allows for highly customizable content moderation, such as preventing specific chatbot behaviors like unauthorized refunds or medical advice.
Key Capabilities
- Dynamic Policies: Enforces arbitrary, natural language policies, enabling bespoke and application-specific moderation rules.
- Interpretability: Generates detailed, natural-language explanations for policy violations, facilitating human-in-the-loop refinement and chatbot recovery.
- Dual-Mode Inference: Supports both a Fast Inference mode for direct
PASS/FAILclassification with minimal latency and a Chain-of-Thought (CoT) mode that provides a reasoning trace before classification. - Strong Performance: Achieves competitive performance on safety and compliance benchmarks, outperforming other dedicated guardian models and generalist models like GPT-4o-mini on the DynaBench test set.
When to Use DynaGuard-4B
- Custom Content Moderation: Ideal for scenarios requiring flexible, application-specific guardrails beyond predefined harm categories.
- Explainable AI: When understanding why a policy was violated is crucial for debugging or user feedback.
- Balancing Latency and Detail: Choose between fast, direct classification or detailed, reasoned explanations based on application needs.
This model was fine-tuned on a mixture of the DynaBench dataset and several safety benchmarks (WildGuard, BeaverTails, ToxicChat, Aegis 2.0) using Supervised Fine-Tuning (SFT) and GRPO.