thu-coai/ShieldAgent

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Feb 20, 2025License:mitArchitecture:Transformer0.0K Open Weights Cold

The thu-coai/ShieldAgent is a 7.6 billion parameter safety judgment model, fine-tuned from Qwen-2.5-7B-Instruct, designed to assess the behavioral safety of LLM agents. It generates detailed explanations for safety judgments based on agent interaction records, including tool calling requests and results. This model excels at identifying unsafe agent behaviors and provides comprehensive analysis, achieving 91.5% accuracy on agent behavioral safety judgment, significantly outperforming GPT-4o on the same task.

Loading preview...

Model Overview

ShieldAgent is a specialized 7.6 billion parameter model, initialized from Qwen-2.5-7B-Instruct, developed by thu-coai. Its primary function is to act as a safety judgment model for evaluating the behavioral safety of large language model (LLM) agents. The model is trained on 4000 agent interaction records, which include tool calling requests and their outcomes, along with manual labels and analyses generated by GPT-4o.

Key Capabilities

  • Agent Behavioral Safety Assessment: Accurately judges whether an LLM agent's responses and behaviors are safe or unsafe.
  • Detailed Explanation Generation: Provides comprehensive analyses corresponding to its safety judgments, offering insights into potential implications.
  • Tool Interaction Analysis: Specifically trained to evaluate safety within contexts involving tool usage by agents.
  • High Accuracy: Achieves 91.5% accuracy on a test set of 200 interaction records from Gemini-1.5-Flash, demonstrating superior performance compared to GPT-4o (75.5% accuracy) in this specific task.

Use Cases

ShieldAgent is ideal for developers and researchers focused on:

  • Enhancing LLM Agent Safety: Integrating a robust mechanism for real-time or post-hoc safety evaluation of agent interactions.
  • Automated Safety Audits: Automating the process of identifying and analyzing unsafe behaviors in agent-based systems.
  • Developing Safer AI Applications: Ensuring that LLM agents operate within defined safety parameters, especially when interacting with external tools or sensitive information.

For more detailed information and usage examples, refer to the Agent-SafetyBench GitHub Repository.