Alibaba-NLP/WebWatcher-32B
WebWatcher-32B by Alibaba-NLP is a 32 billion parameter multimodal agent designed for deep research, featuring enhanced visual-language reasoning and multi-tool interaction within a unified framework. It excels at complex reasoning, information retrieval, and knowledge-retrieval integration, significantly outperforming proprietary baselines like GPT-4o on benchmarks such as HLE-VL, BrowseComp-VL, LiveVQA, and MMSearch. Its primary use case is advanced visual search and in-depth multimodal reasoning tasks requiring strategic planning and tool use.
Loading preview...
WebWatcher: A Multimodal Deep Research Agent
WebWatcher, developed by Alibaba-NLP, is a 32 billion parameter multimodal agent specifically engineered for deep research, integrating advanced visual-language reasoning with multi-tool interaction. This model introduces a unified framework to tackle complex information gathering and analysis tasks.
Key Capabilities
- Enhanced Visual-Language Reasoning: Combines visual perception with advanced language understanding for in-depth analysis.
- Multi-Tool Interaction: Equipped with tools such as Web Image Search, Web Text Search, Webpage Visit, Code Interpreter, and an internal OCR tool for comprehensive information gathering.
- Automated Trajectory Generation: Utilizes an automated pipeline to create high-quality, multi-step reasoning trajectories for robust tool-use capabilities and efficient training.
- Superior Performance: Significantly outperforms proprietary models like GPT-4o, Gemini2.5-flash, and Qwen2.5-VL-72B across challenging VQA benchmarks.
Performance Highlights
WebWatcher-32B demonstrates leading performance on several benchmarks:
- Complex Reasoning (HLE-VL): Achieved a Pass@1 score of 13.6%, surpassing GPT-4o (9.8%).
- Information Retrieval (MMSearch): Scored 55.3% Pass@1, outperforming Gemini2.5-flash (43.9%) and GPT-4o (24.1%).
- Knowledge-Retrieval Integration (LiveVQA): Achieved 58.7% Pass@1, exceeding Gemini2.5-flash (41.3%) and GPT-4o (34.0%).
- Information Optimization and Aggregation (BrowseComp-VL): Dominated with an average score of 27.0%, more than doubling GPT-4o (13.4%) and Gemini2.5-flash (13.0%).
Good for
- Deep Research Tasks: Ideal for scenarios requiring extensive information gathering and complex reasoning across visual and textual modalities.
- Advanced Visual Search: Excels in real-world visual search benchmarks, providing precise retrieval and robust information aggregation.
- Multimodal Agent Development: Serves as a strong baseline for developing and evaluating multimodal agents that require strategic planning and tool use.