coder3101/Qwen3-VL-4B-Instruct-heretic

VISIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Nov 23, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

The coder3101/Qwen3-VL-4B-Instruct-heretic is a 4 billion parameter vision-language model, derived from Qwen's Qwen3-VL-4B-Instruct, with its refusal behavior significantly reduced through the Heretic v1.0.1 abliteration process. This model retains the original's advanced multimodal reasoning, visual agent capabilities, and long context understanding, but is specifically modified to exhibit fewer refusals, making it suitable for applications requiring less constrained responses. It excels in tasks involving visual coding, spatial perception, and enhanced OCR across 32 languages.

Loading preview...

Overview

This model, coder3101/Qwen3-VL-4B-Instruct-heretic, is a 4 billion parameter vision-language model based on Qwen's Qwen3-VL-4B-Instruct. Its primary distinction is the application of the Heretic v1.0.1 abliteration process, which significantly reduces refusal rates compared to the original model. While the original Qwen3-VL-4B-Instruct exhibited 91 refusals out of 100, this 'heretic' version shows only 3 refusals out of 100, as measured by KL divergence of 0.47.

Key Capabilities

This model inherits the comprehensive upgrades of the Qwen3-VL series, offering advanced multimodal functionalities:

  • Visual Agent: Capable of operating PC/mobile GUIs, recognizing elements, understanding functions, and completing tasks.
  • Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images or videos.
  • Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions, enabling 2D and 3D grounding for spatial reasoning.
  • Long Context & Video Understanding: Features a native 256K context, expandable to 1M, for handling extensive text and hours-long video with full recall.
  • Enhanced Multimodal Reasoning: Excels in STEM/Math tasks, providing causal analysis and logical, evidence-based answers.
  • Upgraded Visual Recognition: Broad and high-quality pretraining allows recognition of a wide array of entities, including celebrities, products, and landmarks.
  • Expanded OCR: Supports 32 languages, with robust performance in challenging conditions and improved long-document structure parsing.
  • Text Understanding: Offers seamless text-vision fusion for unified comprehension on par with pure LLMs.

When to Use This Model

This model is particularly suited for use cases where the original Qwen3-VL-4B-Instruct's refusal behavior is undesirable. Developers needing a powerful vision-language model with strong multimodal capabilities, but requiring less constrained or more direct responses, will find this 'heretic' version beneficial. It is ideal for applications demanding robust visual understanding, complex reasoning, and reduced content filtering.