Overview
karanxa/saroku-safety-0.5b is a 494 million parameter behavioral safety classifier, fine-tuned from Qwen/Qwen2.5-0.5B-Instruct. Unlike general-purpose safety models, it is purpose-built for LLM agent pipelines, focusing on detecting behavioral threats specific to agents.
Key Capabilities
- Detects 9 classes of unsafe agent behavior: Includes categories like
prompt_injection, trust_hierarchy, goal_drift, corrigibility, minimal_footprint, sycophancy, honesty, and consistency, in addition to safe actions. - Agent-specific threat detection: Uniquely identifies behavioral threats such as an agent resisting shutdown (corrigibility), requesting excessive permissions (minimal footprint), or abandoning correct behavior due to user pressure (sycophancy).
- Superior performance: Achieves 98% overall binary accuracy on its benchmark, outperforming models like Granite Guardian 2B (73%), Llama Guard 3 1B (53%), and ShieldGemma 2B (18%).
- High recall on behavioral threats: Detects 100% of behavioral threats that other models are not designed to catch, leading the next-best competitor by a 44-point gap in Section B of its benchmark.
- Efficient inference: Requires approximately 1GB VRAM for inference, making it suitable for deployment in agent systems.
Good For
- Developers building LLM agents who need to ensure behavioral safety and prevent failure modes like goal drift, sycophancy, and corrigibility.
- Integrating a specialized safety layer into agent pipelines to catch threats that traditional content moderation models overlook.
- Applications where agents might interact with users or systems and require robust checks against unintended or harmful actions.