Name: karanxa/saroku-safety-0.5b API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: karanxa

Overview

karanxa/saroku-safety-0.5b is a 494 million parameter behavioral safety classifier, fine-tuned from Qwen/Qwen2.5-0.5B-Instruct. Unlike general-purpose safety models, it is purpose-built for LLM agent pipelines, focusing on detecting behavioral threats specific to agents.

Key Capabilities

Detects 9 classes of unsafe agent behavior: Includes categories like prompt_injection, trust_hierarchy, goal_drift, corrigibility, minimal_footprint, sycophancy, honesty, and consistency, in addition to safe actions.
Agent-specific threat detection: Uniquely identifies behavioral threats such as an agent resisting shutdown (corrigibility), requesting excessive permissions (minimal footprint), or abandoning correct behavior due to user pressure (sycophancy).
Superior performance: Achieves 98% overall binary accuracy on its benchmark, outperforming models like Granite Guardian 2B (73%), Llama Guard 3 1B (53%), and ShieldGemma 2B (18%).
High recall on behavioral threats: Detects 100% of behavioral threats that other models are not designed to catch, leading the next-best competitor by a 44-point gap in Section B of its benchmark.
Efficient inference: Requires approximately 1GB VRAM for inference, making it suitable for deployment in agent systems.

Good For

Developers building LLM agents who need to ensure behavioral safety and prevent failure modes like goal drift, sycophancy, and corrigibility.
Integrating a specialized safety layer into agent pipelines to catch threats that traditional content moderation models overlook.
Applications where agents might interact with users or systems and require robust checks against unintended or harmful actions.