unsloth/Qwen3-VL-4B-Thinking

VISIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Oct 14, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

The unsloth/Qwen3-VL-4B-Thinking model is a 4 billion parameter vision-language model from the Qwen series, developed by Qwen. It features comprehensive upgrades for superior text understanding and generation, deeper visual perception and reasoning, and an extended 32768 token context length. This 'Thinking' edition is specifically enhanced for reasoning tasks, excelling in visual agent capabilities, spatial perception, and multimodal reasoning for STEM/Math problems.

Loading preview...

Qwen3-VL-4B-Thinking: Enhanced Vision-Language Model

Qwen3-VL-4B-Thinking is a 4 billion parameter vision-language model from the Qwen series, specifically designed with reasoning enhancements. This model delivers significant upgrades in text understanding, visual perception, and reasoning capabilities, supporting an extended context length of 32768 tokens.

Key Capabilities

  • Visual Agent: Interacts with PC/mobile GUIs, recognizing elements, understanding functions, and completing tasks.
  • Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from image and video inputs.
  • Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions, enabling 2D and 3D grounding for spatial reasoning.
  • Long Context & Video Understanding: Features a native 256K context, expandable to 1M, capable of processing long documents and hours of video with precise indexing.
  • Enhanced Multimodal Reasoning: Excels in STEM/Math tasks, providing causal analysis and logical, evidence-based answers.
  • Upgraded Visual Recognition: Trained on broader, higher-quality data to recognize a wide array of entities including celebrities, products, and flora/fauna.
  • Expanded OCR: Supports 32 languages, robustly handling challenging conditions like low light, blur, and tilt, and improving long-document structure parsing.
  • Text Understanding: Achieves text comprehension on par with pure LLMs through seamless text-vision fusion.

Architectural Innovations

  • Interleaved-MRoPE: Utilizes full-frequency allocation via robust positional embeddings for enhanced long-horizon video reasoning.
  • DeepStack: Fuses multi-level ViT features to capture fine-grained details and improve image-text alignment.
  • Text–Timestamp Alignment: Employs precise, timestamp-grounded event localization for stronger video temporal modeling.