Name: n0kovo/Qwen3-VL-32B-Instruct-heretic-v2 API
Brand: Featherless.ai
Price: 25.00 USD
Availability: InStock
Author: n0kovo

Model Overview

This model, n0kovo/Qwen3-VL-32B-Instruct-heretic-v2, is a 33.4 billion parameter vision-language model based on the Qwen3-VL-32B-Instruct architecture. It has been decensored using Heretic v1.1.0, significantly reducing refusals from 99/100 in the original model to 7/100, while maintaining a KL divergence of 0.1565 compared to the original. The model features a 32768 token context length and represents a comprehensive upgrade in the Qwen series, focusing on superior text understanding, deeper visual perception, and enhanced reasoning capabilities.

Key Capabilities

Visual Agent: Capable of operating PC/mobile GUIs, recognizing elements, understanding functions, and completing tasks.
Visual Coding Boost: Generates Draw.io, HTML, CSS, and JavaScript from image and video inputs.
Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions, enabling 2D and 3D grounding for spatial reasoning.
Long Context & Video Understanding: Supports a native 256K context, expandable to 1M, for handling extensive documents and hours-long video with precise recall.
Enhanced Multimodal Reasoning: Excels in STEM/Math tasks, providing causal analysis and logical, evidence-based answers.
Upgraded Visual Recognition: Broad and high-quality pretraining allows recognition of a wide array of entities, including celebrities, products, and landmarks.
Expanded OCR: Supports 32 languages with robust performance in challenging conditions and improved long-document structure parsing.
Seamless Text-Vision Fusion: Achieves lossless, unified comprehension by integrating text and vision understanding on par with pure LLMs.

Architectural Innovations

Interleaved-MRoPE: Utilizes full-frequency allocation over time, width, and height via robust positional embeddings for enhanced long-horizon video reasoning.
DeepStack: Fuses multi-level ViT features to capture fine-grained details and sharpen image-text alignment.
Text-Timestamp Alignment: Employs precise, timestamp-grounded event localization for stronger video temporal modeling.

Good for

Applications requiring advanced visual understanding and reasoning.
Tasks involving GUI automation and visual coding generation.
Use cases demanding robust multimodal interaction with reduced content restrictions.

Overview

Model Overview

Key Capabilities

Architectural Innovations

Good for

Full Model Card (README)