Name: Qwen/Qwen2.5-VL-32B-Instruct API
Brand: Featherless.ai
Price: 25.00 USD
Availability: InStock
Author: Qwen

Overview

Qwen2.5-VL-32B-Instruct is a 32 billion parameter vision-language model from the Qwen family, building upon the Qwen2-VL architecture. It has been significantly enhanced with reinforcement learning to improve mathematical and problem-solving capabilities, as well as subjective user experience, particularly for objective queries.

Key Capabilities

Advanced Visual Understanding: Proficient in recognizing common objects and analyzing complex visual elements like texts, charts, icons, graphics, and layouts within images.
Agentic Behavior: Functions as a visual agent capable of reasoning and dynamically directing tools for computer and phone use.
Long Video Comprehension: Can understand videos exceeding one hour, with a new ability to capture and pinpoint relevant events within video segments.
Visual Localization: Accurately localizes objects in images by generating bounding boxes or points, providing stable JSON outputs for coordinates and attributes.
Structured Output Generation: Supports structured outputs for data from invoices, forms, and tables, beneficial for finance and commerce applications.
Dynamic Resolution and Frame Rate Training: Incorporates dynamic FPS sampling and mRoPE updates in the time dimension for enhanced video understanding and event pinpointing.
Efficient Vision Encoder: Features a streamlined ViT architecture with window attention, SwiGLU, and RMSNorm for improved training and inference speeds.

Performance Highlights

Evaluations show strong performance across various vision and text benchmarks. For vision tasks, Qwen2.5-VL-32B demonstrates competitive scores on MMMU, MathVista, OCRBenchV2, and video understanding datasets like LVBench and CharadesSTA. In text-based evaluations, it achieves notable results on MMLU, MATH, and HumanEval, often outperforming or matching models like Mistral-Small-3.1-24B and Gemma3-27B-IT in specific categories.

Good For

Applications requiring detailed visual analysis and understanding of complex image content.
Developing visual agents for automated computer or phone interactions.
Processing and extracting structured data from visual documents like invoices and forms.
Analyzing and summarizing long video content, including event detection.
Tasks benefiting from precise visual localization and structured output generation.

Overview

Overview

Key Capabilities

Performance Highlights

Good For

Full Model Card (README)