Name: cesun/SODA-Agent-Safety-Judge API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: cesun

SODA-Agent-Safety-Judge Overview

The cesun/SODA-Agent-Safety-Judge is a specialized safety evaluation model, fine-tuned from the Qwen/Qwen3-4B-Instruct-2507 base model. Its primary function is to act as a safety judge for AI agent interactions, specifically assessing whether tool-calling LLM agents safely refuse or dangerously comply with harmful user requests. The model was trained on a substantial dataset of 170,000 judgments from the SODA benchmark, with Claude Opus 4.6 serving as the teacher model.

Key Capabilities

Binary Safety Classification: Outputs a clear SAFE or UNSAFE verdict for agent responses.
Chain-of-Thought Reasoning: Provides a one-sentence explanation for its judgment, enhancing transparency.
High Agreement with Claude Opus 4.6: Achieves 98.9% accuracy on the in-domain SODA test set and 97.9% accuracy on the zero-shot AgentHarm benchmark, closely matching the teacher model's performance.
Specialized for Agent Trajectories: Designed to evaluate multi-turn tool-calling agent conversations, including user requests, agent responses, and tool execution results.

Intended Use Cases

Replacing Expensive API Calls: Ideal for substituting costly Claude API calls in agent safety evaluation pipelines.
Agent Safety Benchmarking: Specifically useful for evaluating agent safety within the SODA benchmark and similar contexts.
Developer Tool: Provides a programmatic way to assess the safety of agent interactions during development and testing.

It's important to note that this model is not a general-purpose safety classifier but is highly specialized for judging tool-calling agent trajectories.

Overview

SODA-Agent-Safety-Judge Overview

Key Capabilities

Intended Use Cases

Full Model Card (README)