tomg-group-umd/DynaGuard-4B

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Jul 22, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

DynaGuard-4B is a 4 billion parameter decoder-only Transformer model developed by the University of Maryland and Capital One, based on Qwen3-4B. It is fine-tuned to act as a dynamic guardian model, evaluating text against user-defined natural language policies for content moderation. This model excels at enforcing bespoke application-specific rules and provides interpretability through Chain-of-Thought reasoning, offering both fast inference and detailed violation explanations.

Loading preview...

DynaGuard-4B: Dynamic Policy Enforcement for LLMs

DynaGuard-4B is a 4 billion parameter model from the DynaGuard series, developed by the University of Maryland and Capital One. This model is specifically designed to act as a dynamic guardian, evaluating text against user-defined natural language policies. Unlike traditional guardian models that rely on fixed harm categories, DynaGuard-4B allows for highly customizable content moderation, such as preventing specific chatbot behaviors like unauthorized refunds or medical advice.

Key Capabilities

  • Dynamic Policies: Enforces arbitrary, natural language policies, enabling bespoke and application-specific moderation rules.
  • Interpretability: Generates detailed, natural-language explanations for policy violations, facilitating human-in-the-loop refinement and chatbot recovery.
  • Dual-Mode Inference: Supports both a Fast Inference mode for direct PASS/FAIL classification with minimal latency and a Chain-of-Thought (CoT) mode that provides a reasoning trace before classification.
  • Strong Performance: Achieves competitive performance on safety and compliance benchmarks, outperforming other dedicated guardian models and generalist models like GPT-4o-mini on the DynaBench test set.

When to Use DynaGuard-4B

  • Custom Content Moderation: Ideal for scenarios requiring flexible, application-specific guardrails beyond predefined harm categories.
  • Explainable AI: When understanding why a policy was violated is crucial for debugging or user feedback.
  • Balancing Latency and Detail: Choose between fast, direct classification or detailed, reasoned explanations based on application needs.

This model was fine-tuned on a mixture of the DynaBench dataset and several safety benchmarks (WildGuard, BeaverTails, ToxicChat, Aegis 2.0) using Supervised Fine-Tuning (SFT) and GRPO.