n0kovo/Qwen3-VL-32B-Instruct-heretic-v2

VISIONConcurrency Cost:2Model Size:33.4BQuant:FP8Ctx Length:32kPublished:Dec 16, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

n0kovo/Qwen3-VL-32B-Instruct-heretic-v2 is a 33.4 billion parameter vision-language model, a decensored version of Qwen/Qwen3-VL-32B-Instruct created using Heretic v1.1.0. Developed by Qwen, this model features comprehensive upgrades in text understanding, visual perception, and reasoning, with an extended context length of 32768 tokens. It is specifically designed for enhanced multimodal reasoning, visual agent capabilities, and advanced spatial perception, making it suitable for complex visual and textual tasks.

Loading preview...

Model Overview

This model, n0kovo/Qwen3-VL-32B-Instruct-heretic-v2, is a 33.4 billion parameter vision-language model based on the Qwen3-VL-32B-Instruct architecture. It has been decensored using Heretic v1.1.0, significantly reducing refusals from 99/100 in the original model to 7/100, while maintaining a KL divergence of 0.1565 compared to the original. The model features a 32768 token context length and represents a comprehensive upgrade in the Qwen series, focusing on superior text understanding, deeper visual perception, and enhanced reasoning capabilities.

Key Capabilities

  • Visual Agent: Capable of operating PC/mobile GUIs, recognizing elements, understanding functions, and completing tasks.
  • Visual Coding Boost: Generates Draw.io, HTML, CSS, and JavaScript from image and video inputs.
  • Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions, enabling 2D and 3D grounding for spatial reasoning.
  • Long Context & Video Understanding: Supports a native 256K context, expandable to 1M, for handling extensive documents and hours-long video with precise recall.
  • Enhanced Multimodal Reasoning: Excels in STEM/Math tasks, providing causal analysis and logical, evidence-based answers.
  • Upgraded Visual Recognition: Broad and high-quality pretraining allows recognition of a wide array of entities, including celebrities, products, and landmarks.
  • Expanded OCR: Supports 32 languages with robust performance in challenging conditions and improved long-document structure parsing.
  • Seamless Text-Vision Fusion: Achieves lossless, unified comprehension by integrating text and vision understanding on par with pure LLMs.

Architectural Innovations

  • Interleaved-MRoPE: Utilizes full-frequency allocation over time, width, and height via robust positional embeddings for enhanced long-horizon video reasoning.
  • DeepStack: Fuses multi-level ViT features to capture fine-grained details and sharpen image-text alignment.
  • Text-Timestamp Alignment: Employs precise, timestamp-grounded event localization for stronger video temporal modeling.

Good for

  • Applications requiring advanced visual understanding and reasoning.
  • Tasks involving GUI automation and visual coding generation.
  • Use cases demanding robust multimodal interaction with reduced content restrictions.