unsloth/Qwen2.5-VL-32B-Instruct

VISIONConcurrency Cost:2Model Size:32BQuant:FP8Ctx Length:32kPublished:Mar 24, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

The unsloth/Qwen2.5-VL-32B-Instruct is a 32 billion parameter instruction-tuned vision-language model developed by Qwen, part of the Qwen2.5-VL family. It excels at multimodal understanding, including detailed image analysis, video comprehension, and visual localization with structured outputs. This model is particularly optimized for complex visual reasoning, mathematical problem-solving, and acting as a visual agent.

Loading preview...

Qwen2.5-VL-32B-Instruct: Advanced Vision-Language Model

Qwen2.5-VL-32B-Instruct is a 32 billion parameter instruction-tuned model from the Qwen2.5-VL series, developed by Qwen. It significantly enhances multimodal capabilities, building upon its predecessors with improved mathematical and problem-solving abilities through reinforcement learning, leading to better subjective user experience and clearer response formatting for objective queries.

Key Capabilities

  • Comprehensive Visual Understanding: Proficient in recognizing common objects, analyzing texts, charts, icons, graphics, and layouts within images.
  • Agentic Behavior: Functions as a visual agent capable of reasoning and dynamically directing tools for computer and phone use.
  • Long Video Comprehension: Can understand videos over 1 hour, with a new ability to capture and pinpoint relevant events within video segments.
  • Precise Visual Localization: Accurately localizes objects in images using bounding boxes or points, providing stable JSON outputs for coordinates and attributes.
  • Structured Output Generation: Supports structured outputs for data from invoices, forms, and tables, beneficial for financial and commercial applications.
  • Dynamic Resolution and Frame Rate Training: Utilizes dynamic FPS sampling and mRoPE updates for enhanced temporal understanding in videos.
  • Optimized Vision Encoder: Features a streamlined ViT architecture with window attention, SwiGLU, and RMSNorm for improved training and inference speeds.

Good For

  • Applications requiring detailed image and video analysis.
  • Tasks involving visual reasoning, such as mathematical problems and logical reasoning from visual inputs.
  • Use cases needing structured data extraction from visual documents like invoices or forms.
  • Developing visual agents for computer or phone interaction.
  • Scenarios demanding precise object localization and attribute extraction from images.