Qwen/Qwen2.5-VL-72B-Instruct

VISIONConcurrency Cost:4Model Size:72BQuant:FP8Ctx Length:32kPublished:Jan 27, 2025License:qwenArchitecture:Transformer0.6K Cold

Qwen/Qwen2.5-VL-72B-Instruct is a 72 billion parameter instruction-tuned vision-language model developed by Qwen, designed for advanced multimodal understanding. It excels at analyzing images, charts, and layouts, comprehending long videos, and performing visual localization with structured outputs. This model is particularly strong in visual agency, enabling it to reason and dynamically direct tools for computer and phone use.

Loading preview...

Overview

Qwen2.5-VL-72B-Instruct is the latest 72 billion parameter instruction-tuned vision-language model from the Qwen family, building upon feedback from its predecessor, Qwen2-VL. This model introduces significant enhancements in visual and video understanding, making it a powerful tool for complex multimodal tasks.

Key Capabilities

  • Advanced Visual Understanding: Proficient in recognizing common objects and analyzing intricate visual elements like texts, charts, icons, graphics, and layouts within images.
  • Visual Agency: Functions as a visual agent capable of reasoning and dynamically directing tools for computer and phone interactions.
  • Long Video Comprehension: Can understand videos exceeding one hour in duration, with a new ability to pinpoint relevant events within video segments.
  • Precise Visual Localization: Accurately localizes objects in images using bounding boxes or points, providing stable JSON outputs for coordinates and attributes.
  • Structured Output Generation: Supports structured outputs for data from invoices, forms, and tables, beneficial for financial and commercial applications.

Model Architecture Updates

  • Dynamic Resolution and Frame Rate Training: Extends dynamic resolution to the temporal dimension using dynamic FPS sampling, enhancing video comprehension across various sampling rates. This includes updated mRoPE for temporal sequence and speed learning.
  • Efficient Vision Encoder: Improves training and inference speeds through window attention in the ViT, further optimized with SwiGLU and RMSNorm to align with the Qwen2.5 LLM structure.

Performance

Evaluations show Qwen2.5-VL-72B-Instruct achieving competitive or leading scores across various image, video, and agent benchmarks, including MMMU, MathVista, DocVQA, VideoMME, and ScreenSpot, often outperforming previous Qwen-VL versions and other leading models like GPT4o and Claude3.5 Sonnet in specific tasks.