Alibaba-NLP/WebWatcher-32B

VISIONConcurrency Cost:2Model Size:32BQuant:FP8Ctx Length:32kPublished:Aug 27, 2025Architecture:Transformer0.0K Cold

WebWatcher-32B by Alibaba-NLP is a 32 billion parameter multimodal agent designed for deep research, featuring enhanced visual-language reasoning and multi-tool interaction within a unified framework. It excels at complex reasoning, information retrieval, and knowledge-retrieval integration, significantly outperforming proprietary baselines like GPT-4o on benchmarks such as HLE-VL, BrowseComp-VL, LiveVQA, and MMSearch. Its primary use case is advanced visual search and in-depth multimodal reasoning tasks requiring strategic planning and tool use.

Loading preview...

WebWatcher: A Multimodal Deep Research Agent

WebWatcher, developed by Alibaba-NLP, is a 32 billion parameter multimodal agent specifically engineered for deep research, integrating advanced visual-language reasoning with multi-tool interaction. This model introduces a unified framework to tackle complex information gathering and analysis tasks.

Key Capabilities

  • Enhanced Visual-Language Reasoning: Combines visual perception with advanced language understanding for in-depth analysis.
  • Multi-Tool Interaction: Equipped with tools such as Web Image Search, Web Text Search, Webpage Visit, Code Interpreter, and an internal OCR tool for comprehensive information gathering.
  • Automated Trajectory Generation: Utilizes an automated pipeline to create high-quality, multi-step reasoning trajectories for robust tool-use capabilities and efficient training.
  • Superior Performance: Significantly outperforms proprietary models like GPT-4o, Gemini2.5-flash, and Qwen2.5-VL-72B across challenging VQA benchmarks.

Performance Highlights

WebWatcher-32B demonstrates leading performance on several benchmarks:

  • Complex Reasoning (HLE-VL): Achieved a Pass@1 score of 13.6%, surpassing GPT-4o (9.8%).
  • Information Retrieval (MMSearch): Scored 55.3% Pass@1, outperforming Gemini2.5-flash (43.9%) and GPT-4o (24.1%).
  • Knowledge-Retrieval Integration (LiveVQA): Achieved 58.7% Pass@1, exceeding Gemini2.5-flash (41.3%) and GPT-4o (34.0%).
  • Information Optimization and Aggregation (BrowseComp-VL): Dominated with an average score of 27.0%, more than doubling GPT-4o (13.4%) and Gemini2.5-flash (13.0%).

Good for

  • Deep Research Tasks: Ideal for scenarios requiring extensive information gathering and complex reasoning across visual and textual modalities.
  • Advanced Visual Search: Excels in real-world visual search benchmarks, providing precise retrieval and robust information aggregation.
  • Multimodal Agent Development: Serves as a strong baseline for developing and evaluating multimodal agents that require strategic planning and tool use.