Qwen/Qwen2.5-VL-32B-Instruct

VISIONConcurrency Cost:2Model Size:32BQuant:FP8Ctx Length:32kPublished:Mar 21, 2025License:apache-2.0Architecture:Transformer0.5K Open Weights Cold

Qwen2.5-VL-32B-Instruct is a 32 billion parameter vision-language model developed by Qwen, enhanced for mathematical and problem-solving abilities through reinforcement learning. It excels at visual understanding, including analyzing texts, charts, and layouts within images, and can act as a visual agent for computer and phone use. The model also features advanced video comprehension, capable of understanding long videos and pinpointing specific events, and supports visual localization with structured JSON outputs for coordinates and attributes.

Loading preview...

Overview

Qwen2.5-VL-32B-Instruct is a 32 billion parameter vision-language model from the Qwen family, building upon the Qwen2-VL architecture. It has been significantly enhanced with reinforcement learning to improve mathematical and problem-solving capabilities, as well as subjective user experience, particularly for objective queries.

Key Capabilities

  • Advanced Visual Understanding: Proficient in recognizing common objects and analyzing complex visual elements like texts, charts, icons, graphics, and layouts within images.
  • Agentic Behavior: Functions as a visual agent capable of reasoning and dynamically directing tools for computer and phone use.
  • Long Video Comprehension: Can understand videos exceeding one hour, with a new ability to capture and pinpoint relevant events within video segments.
  • Visual Localization: Accurately localizes objects in images by generating bounding boxes or points, providing stable JSON outputs for coordinates and attributes.
  • Structured Output Generation: Supports structured outputs for data from invoices, forms, and tables, beneficial for finance and commerce applications.
  • Dynamic Resolution and Frame Rate Training: Incorporates dynamic FPS sampling and mRoPE updates in the time dimension for enhanced video understanding and event pinpointing.
  • Efficient Vision Encoder: Features a streamlined ViT architecture with window attention, SwiGLU, and RMSNorm for improved training and inference speeds.

Performance Highlights

Evaluations show strong performance across various vision and text benchmarks. For vision tasks, Qwen2.5-VL-32B demonstrates competitive scores on MMMU, MathVista, OCRBenchV2, and video understanding datasets like LVBench and CharadesSTA. In text-based evaluations, it achieves notable results on MMLU, MATH, and HumanEval, often outperforming or matching models like Mistral-Small-3.1-24B and Gemma3-27B-IT in specific categories.

Good For

  • Applications requiring detailed visual analysis and understanding of complex image content.
  • Developing visual agents for automated computer or phone interactions.
  • Processing and extracting structured data from visual documents like invoices and forms.
  • Analyzing and summarizing long video content, including event detection.
  • Tasks benefiting from precise visual localization and structured output generation.