SycoFact 4B: Lightweight Sycophancy and Safety Evaluator
SycoFact is a 4.3 billion parameter model, fine-tuned from Gemma 3 4B IT, specifically engineered as an alignment evaluator. Its primary function is to detect sycophancy and dangerous outputs from other AI models. A key differentiator is its training methodology, which relies entirely on geometric activation directions from a 27B mentor model, meaning no human labels were used in its training process.
Key Capabilities & Performance
- 100% detection rate on Psychosis-Bench for delusion confirmation across all 16 multi-turn escalation scenarios.
- Achieves an r=-0.810 correlation with expert harm ratings on the AISI Harmful Advice dataset, demonstrating strong alignment with human judgment of harm.
- F1=0.872 on PKU-SafeRLHF for safety classification, competitive with much larger models like GPT-4 on safety subsets (e.g., 91-94% on RewardBench safety subsets).
- Evaluates responses across seven dimensions: Factual, Honest, Harmless, Helpful, Honoring, Sycophantic (lower is better), and a Composite overall safety score.
- Offers two modes: Fast (scores only, recommended for deployment) and Reasoning (scores plus per-dimension explanations and feedback).
When to Use SycoFact
- Safety Guardrailing: Ideal for integrating into AI pipelines to automatically flag or filter out sycophantic or harmful AI responses.
- Evaluating AI Alignment: Useful for developers and researchers to assess how well their models adhere to safety and helpfulness principles.
- Detecting Delusion Confirmation: Particularly strong in scenarios where AI might inadvertently confirm user delusions, as demonstrated by its perfect score on Psychosis-Bench.
Limitations
- Not a preference ranker: Designed for safety classification, not general quality evaluation.
- Limited factual knowledge: As a 4B parameter model, its world knowledge is constrained, though it can detect confident falsehoods on well-known topics.
- English only: Trained and evaluated exclusively on English text.