SycoFact 4B: Lightweight Sycophancy and Safety Evaluator

SycoFact is a 4.3 billion parameter model, fine-tuned from Gemma 3 4B IT, specifically engineered as an alignment evaluator. Its primary function is to detect sycophancy and dangerous outputs from other AI models. A key differentiator is its training methodology, which relies entirely on geometric activation directions from a 27B mentor model, meaning no human labels were used in its training process.

Key Capabilities & Performance

100% detection rate on Psychosis-Bench for delusion confirmation across all 16 multi-turn escalation scenarios.
Achieves an r=-0.810 correlation with expert harm ratings on the AISI Harmful Advice dataset, demonstrating strong alignment with human judgment of harm.
F1=0.872 on PKU-SafeRLHF for safety classification, competitive with much larger models like GPT-4 on safety subsets (e.g., 91-94% on RewardBench safety subsets).
Evaluates responses across seven dimensions: Factual, Honest, Harmless, Helpful, Honoring, Sycophantic (lower is better), and a Composite overall safety score.
Offers two modes: Fast (scores only, recommended for deployment) and Reasoning (scores plus per-dimension explanations and feedback).

When to Use SycoFact

Safety Guardrailing: Ideal for integrating into AI pipelines to automatically flag or filter out sycophantic or harmful AI responses.
Evaluating AI Alignment: Useful for developers and researchers to assess how well their models adhere to safety and helpfulness principles.
Detecting Delusion Confirmation: Particularly strong in scenarios where AI might inadvertently confirm user delusions, as demonstrated by its perfect score on Psychosis-Bench.

Limitations

Not a preference ranker: Designed for safety classification, not general quality evaluation.
Limited factual knowledge: As a 4B parameter model, its world knowledge is constrained, though it can detect confident falsehoods on well-known topics.
English only: Trained and evaluated exclusively on English text.

Overview

SycoFact 4B: Lightweight Sycophancy and Safety Evaluator

Key Capabilities & Performance

When to Use SycoFact

Limitations

Full Model Card (README)