coder3101/Qwen3-VL-32B-Instruct-Heretic
coder3101/Qwen3-VL-32B-Instruct-Heretic is a 33.4 billion parameter vision-language model, a decensored version of Qwen/Qwen3-VL-32B-Instruct. Developed by coder3101 using Heretic v1.0.1, this model offers enhanced visual perception, reasoning, and extended context length, with a significantly reduced refusal rate compared to its original counterpart. It is optimized for multimodal tasks including visual agent operations, visual coding, advanced spatial perception, and long context video understanding.
Loading preview...
Overview
This model, coder3101/Qwen3-VL-32B-Instruct-Heretic, is a 33.4 billion parameter vision-language model derived from the Qwen3-VL series. It has been decensored using the Heretic v1.0.1 tool, resulting in a substantial reduction in refusal rates (4/100) compared to the original Qwen/Qwen3-VL-32B-Instruct model (89/100).
Key Capabilities
- Enhanced Visual Perception & Reasoning: Offers deeper understanding of visual inputs, including spatial relationships and video dynamics.
- Extended Context Length: Supports a native 256K context, expandable to 1M, enabling processing of long documents and hours-long video with full recall.
- Visual Agent Operations: Capable of interacting with PC/mobile GUIs, recognizing elements, understanding functions, and completing tasks.
- Visual Coding Boost: Can generate Draw.io, HTML, CSS, and JS from images or videos.
- Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions, supporting 2D and 3D grounding for embodied AI.
- Upgraded Visual Recognition & OCR: Recognizes a broad range of entities and supports OCR in 32 languages, robustly handling challenging conditions and complex document structures.
- Seamless Text-Vision Fusion: Achieves text understanding on par with pure LLMs through unified comprehension.
Architectural Innovations
- Interleaved-MRoPE: Utilizes full-frequency allocation over time, width, and height via robust positional embeddings for enhanced long-horizon video reasoning.
- DeepStack: Fuses multi-level ViT features to capture fine-grained details and improve image-text alignment.
- Text-Timestamp Alignment: Provides precise, timestamp-grounded event localization for stronger video temporal modeling.