Overview
AgentDoG-Qwen3-4B: Trajectory-Level Agent Safety Guardrail
AgentDoG-Qwen3-4B is a 4 billion parameter model developed by AI45Research, built upon the Qwen3-4B-Instruct-2507 base. It serves as a specialized guardrail framework for autonomous agents, focusing on trajectory-level risk assessment rather than single-step content moderation. This model analyzes the entire execution trace of tool-using agents to detect safety risks that emerge at any point during a multi-step process.
Key Capabilities
- Trajectory-Level Monitoring: Evaluates multi-step agent executions, including observations, reasoning, and actions, to identify unsafe behaviors.
- Taxonomy-Guided Diagnosis: Provides detailed risk labels (risk source, failure mode, real-world harm) and diagnoses the root cause of specific unsafe actions, tracing them to planning steps or tool selections.
- High Performance: Outperforms existing safety guard models like LlamaGuard and Qwen3-Guard on benchmarks such as R-Judge (91.8%), ASSE-Safety (80.4%), and ATBench (92.8%).
- Flexible Use Cases: Can function as a benchmark, a risk classifier for agent trajectories, or an integrated guard module within agent systems.
Good for
- Developers building autonomous agents who need robust, fine-grained safety monitoring beyond simple content filtering.
- Evaluating the safety and security of agent systems by assessing full execution trajectories.
- Identifying and understanding the specific causes of unsafe agent behavior through detailed risk categorization.