Model Overview
ShieldAgent is a specialized 7.6 billion parameter model, initialized from Qwen-2.5-7B-Instruct, developed by thu-coai. Its primary function is to act as a safety judgment model for evaluating the behavioral safety of large language model (LLM) agents. The model is trained on 4000 agent interaction records, which include tool calling requests and their outcomes, along with manual labels and analyses generated by GPT-4o.
Key Capabilities
- Agent Behavioral Safety Assessment: Accurately judges whether an LLM agent's responses and behaviors are safe or unsafe.
- Detailed Explanation Generation: Provides comprehensive analyses corresponding to its safety judgments, offering insights into potential implications.
- Tool Interaction Analysis: Specifically trained to evaluate safety within contexts involving tool usage by agents.
- High Accuracy: Achieves 91.5% accuracy on a test set of 200 interaction records from Gemini-1.5-Flash, demonstrating superior performance compared to GPT-4o (75.5% accuracy) in this specific task.
Use Cases
ShieldAgent is ideal for developers and researchers focused on:
- Enhancing LLM Agent Safety: Integrating a robust mechanism for real-time or post-hoc safety evaluation of agent interactions.
- Automated Safety Audits: Automating the process of identifying and analyzing unsafe behaviors in agent-based systems.
- Developing Safer AI Applications: Ensuring that LLM agents operate within defined safety parameters, especially when interacting with external tools or sensitive information.
For more detailed information and usage examples, refer to the Agent-SafetyBench GitHub Repository.