YunoAIdotcom/Qwen3-14B-RefusalDirection-ThinkingAware

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:14BQuant:FP8Ctx Length:32kPublished:Jul 28, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

Qwen3-14B-RefusalDirection-ThinkingAware is a 14 billion parameter research model, forked from Qwen/Qwen3-14B, designed to investigate AI safety mechanisms and their cognitive costs. This model demonstrates significantly reduced safety mechanisms, readily providing harmful content, and reveals that standard keyword-based safety evaluations underestimate bypasses by nearly 50%. It also shows a +0.6% MMLU performance gain by ablating refusal mechanisms, suggesting a cognitive cost to safety alignment, and is intended exclusively for AI safety research.

Loading preview...

Qwen3-14B-RefusalDirection-ThinkingAware: AI Safety Research Model

This model is a research artifact derived from Qwen/Qwen3-14B, specifically engineered to explore the vulnerabilities and mechanisms of AI safety. It features significantly reduced safety protocols, making it prone to generating harmful content, and is strictly for research purposes, not production use.

Key Findings & Capabilities:

  • Evaluation Gap: Demonstrates that standard keyword-based safety metrics underestimate actual safety bypasses by nearly 50%, highlighting a systemic flaw in current AI safety evaluation methodologies.
  • Cognitive Cost of Alignment: By surgically removing refusal mechanisms, the model shows a +0.6% gain in MMLU academic benchmark performance compared to its baseline. This suggests that safety alignment imposes a measurable "cognitive cost" on a model's general reasoning abilities.
  • Context-Dependency of Safety: Utilizes a novel "Thinking-Aware" modification to refusal ablation, revealing that different harmful domains (e.g., cybercrime vs. harassment) are governed by distinct refusal mechanisms, activated under different contexts.
  • High Bypass Rates: Achieves 90% bypass for cybercrime and 70% for misinformation, while showing lower bypass for harassment (26%), indicating the targeted nature of its refusal ablation.

Intended Use:

  • Studying the nature and cost of refusal mechanisms in reasoning models.
  • Benchmarking for the development of more robust safety alignment techniques.
  • Providing empirical evidence for critical flaws in current AI safety evaluation standards.