coder3101/Qwen3-VL-32B-Instruct-heretic-v2
The coder3101/Qwen3-VL-32B-Instruct-heretic-v2 is a 33.4 billion parameter vision-language model, derived from the Qwen/Qwen3-VL-32B-Instruct model and processed with Heretic v1.1.0 for decensoring. This model features enhanced visual perception, reasoning, and extended context length, making it suitable for complex multimodal tasks. It excels in visual agent capabilities, advanced spatial perception, and long context video understanding, offering a versatile solution for applications requiring robust visual and textual comprehension.
Loading preview...
Overview
This model, coder3101/Qwen3-VL-32B-Instruct-heretic-v2, is a 33.4 billion parameter vision-language model based on the Qwen3-VL-32B-Instruct architecture. It has been processed using the Heretic v1.1.0 tool to create a decensored version. The original Qwen3-VL series is noted for its comprehensive upgrades in text understanding, visual perception, reasoning, and extended context length, including enhanced spatial and video dynamics comprehension.
Key Differentiators
This specific model variant distinguishes itself by significantly reducing refusals compared to the original Qwen3-VL-32B-Instruct, with 9 refusals out of 100 compared to 99/100 for the base model. This decensoring was achieved through specific abliteration parameters applied to the attention and MLP layers.
Core Capabilities (from original Qwen3-VL-32B-Instruct)
- Visual Agent: Capable of operating PC/mobile GUIs by recognizing elements, understanding functions, and completing tasks.
- Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos.
- Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions, supporting 2D and 3D grounding for spatial reasoning.
- Long Context & Video Understanding: Features a native 256K context, expandable to 1M, handling extensive text and hours-long video with full recall.
- Enhanced Multimodal Reasoning: Excels in STEM/Math tasks, providing causal analysis and logical, evidence-based answers.
- Upgraded Visual Recognition: Broad and high-quality pretraining enables recognition of a wide array of entities including celebrities, products, and landmarks.
- Expanded OCR: Supports 32 languages and is robust in challenging conditions like low light or blur, with improved long-document structure parsing.
- Text Understanding: Offers seamless text-vision fusion for unified comprehension on par with pure LLMs.
Model Architecture Updates
Key architectural enhancements in the Qwen3-VL series include Interleaved-MRoPE for robust positional embeddings across time, width, and height, DeepStack for fusing multi-level ViT features, and Text-Timestamp Alignment for precise event localization in video.