cesun/SODA-Agent-Safety-Judge
The cesun/SODA-Agent-Safety-Judge is a fine-tuned safety judge model based on Qwen/Qwen3-4B-Instruct-2507, designed for evaluating whether tool-calling LLM agents comply with or refuse harmful requests. Trained on 170K Claude Opus 4.6 judgments from the SODA benchmark, it performs binary safety classification (SAFE/UNSAFE) with chain-of-thought reasoning. This model achieves 98.9% accuracy on the SODA in-domain benchmark and 97.9% on the zero-shot AgentHarm benchmark, making it suitable for replacing expensive API calls in agent safety evaluations.
Loading preview...
SODA-Agent-Safety-Judge Overview
The cesun/SODA-Agent-Safety-Judge is a specialized safety evaluation model, fine-tuned from the Qwen/Qwen3-4B-Instruct-2507 base model. Its primary function is to act as a safety judge for AI agent interactions, specifically assessing whether tool-calling LLM agents safely refuse or dangerously comply with harmful user requests. The model was trained on a substantial dataset of 170,000 judgments from the SODA benchmark, with Claude Opus 4.6 serving as the teacher model.
Key Capabilities
- Binary Safety Classification: Outputs a clear
SAFEorUNSAFEverdict for agent responses. - Chain-of-Thought Reasoning: Provides a one-sentence explanation for its judgment, enhancing transparency.
- High Agreement with Claude Opus 4.6: Achieves 98.9% accuracy on the in-domain SODA test set and 97.9% accuracy on the zero-shot AgentHarm benchmark, closely matching the teacher model's performance.
- Specialized for Agent Trajectories: Designed to evaluate multi-turn tool-calling agent conversations, including user requests, agent responses, and tool execution results.
Intended Use Cases
- Replacing Expensive API Calls: Ideal for substituting costly Claude API calls in agent safety evaluation pipelines.
- Agent Safety Benchmarking: Specifically useful for evaluating agent safety within the SODA benchmark and similar contexts.
- Developer Tool: Provides a programmatic way to assess the safety of agent interactions during development and testing.
It's important to note that this model is not a general-purpose safety classifier but is highly specialized for judging tool-calling agent trajectories.