unsloth/Qwen3-VL-32B-Instruct

VISIONConcurrency Cost:2Model Size:33.4BQuant:FP8Ctx Length:32kPublished:Oct 21, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

unsloth/Qwen3-VL-32B-Instruct is a 33.4 billion parameter vision-language model from the Qwen series, developed by Qwen, offering comprehensive upgrades in text understanding, visual perception, and reasoning. It features an extended context length of 32768 tokens and enhanced spatial and video dynamics comprehension. This model excels in visual agent capabilities, visual coding, and multimodal reasoning, making it suitable for complex vision-language tasks.

Loading preview...

Qwen3-VL-32B-Instruct Overview

Qwen3-VL-32B-Instruct is a powerful 33.4 billion parameter vision-language model from the Qwen series, developed by Qwen, designed for advanced multimodal understanding and generation. It introduces significant enhancements across text and visual domains, including deeper visual perception, extended context handling, and improved reasoning capabilities.

Key Capabilities

  • Visual Agent: Interacts with PC/mobile GUIs, recognizing elements, understanding functions, and completing tasks.
  • Visual Coding Boost: Generates code (Draw.io/HTML/CSS/JS) directly from images and videos.
  • Advanced Spatial Perception: Accurately judges object positions, viewpoints, and occlusions, supporting 2D and 3D grounding for embodied AI.
  • Long Context & Video Understanding: Features a native 256K context, expandable to 1M, enabling full recall and second-level indexing for hours-long video content.
  • Enhanced Multimodal Reasoning: Excels in STEM/Math tasks, providing causal analysis and logical, evidence-based answers.
  • Upgraded Visual Recognition: Trained on broader, higher-quality data to recognize a vast array of entities, including celebrities, products, and landmarks.
  • Expanded OCR: Supports 32 languages with robust performance in challenging conditions and improved long-document structure parsing.
  • Seamless Text-Vision Fusion: Achieves text understanding on par with pure LLMs through unified comprehension.

Model Architecture Updates

Key architectural innovations include Interleaved-MRoPE for enhanced long-horizon video reasoning, DeepStack for fusing multi-level ViT features to sharpen image-text alignment, and Text-Timestamp Alignment for precise, timestamp-grounded event localization in video temporal modeling.