iwalton3/sycofact

VISIONConcurrency Cost:1Model Size:4.3BQuant:BF16Ctx Length:32kPublished:Mar 29, 2026License:gemmaArchitecture:Transformer0.0K Cold

iwalton3/sycofact is a 4.3 billion parameter alignment evaluator, fine-tuned from Gemma 3 4B IT, designed to detect sycophancy and dangerous AI outputs. This model excels at identifying delusion confirmation and harmful advice, achieving 100% detection on Psychosis-Bench and strong correlation with expert harm ratings on the AISI Harmful Advice dataset. It provides a lightweight yet highly effective solution for safety classification and harm reduction in AI responses, with all training signal derived from geometric activation directions without human labels.

Loading preview...

SycoFact 4B: Lightweight Sycophancy and Safety Evaluator

SycoFact is a 4.3 billion parameter model, fine-tuned from Gemma 3 4B IT, specifically engineered as an alignment evaluator. Its primary function is to detect sycophancy and dangerous outputs from other AI models. A key differentiator is its training methodology, which relies entirely on geometric activation directions from a 27B mentor model, meaning no human labels were used in its training process.

Key Capabilities & Performance

  • 100% detection rate on Psychosis-Bench for delusion confirmation across all 16 multi-turn escalation scenarios.
  • Achieves an r=-0.810 correlation with expert harm ratings on the AISI Harmful Advice dataset, demonstrating strong alignment with human judgment of harm.
  • F1=0.872 on PKU-SafeRLHF for safety classification, competitive with much larger models like GPT-4 on safety subsets (e.g., 91-94% on RewardBench safety subsets).
  • Evaluates responses across seven dimensions: Factual, Honest, Harmless, Helpful, Honoring, Sycophantic (lower is better), and a Composite overall safety score.
  • Offers two modes: Fast (scores only, recommended for deployment) and Reasoning (scores plus per-dimension explanations and feedback).

When to Use SycoFact

  • Safety Guardrailing: Ideal for integrating into AI pipelines to automatically flag or filter out sycophantic or harmful AI responses.
  • Evaluating AI Alignment: Useful for developers and researchers to assess how well their models adhere to safety and helpfulness principles.
  • Detecting Delusion Confirmation: Particularly strong in scenarios where AI might inadvertently confirm user delusions, as demonstrated by its perfect score on Psychosis-Bench.

Limitations

  • Not a preference ranker: Designed for safety classification, not general quality evaluation.
  • Limited factual knowledge: As a 4B parameter model, its world knowledge is constrained, though it can detect confident falsehoods on well-known topics.
  • English only: Trained and evaluated exclusively on English text.