OpenSearch-VL/OpenSearch-VL-8B
OpenSearch-VL/OpenSearch-VL-8B is an 8 billion parameter multimodal deep-research agent developed by the OpenSearch-VL project, built on Qwen3-VL. This model is designed for complex, knowledge-intensive visual question answering, operating as a closed-loop agent that inspects images, performs web and image searches, and uses various visual tools to ground its answers in gathered evidence. It excels at multi-turn reasoning and tool use, significantly outperforming agentic baselines on several multimodal benchmarks.
Loading preview...
OpenSearch-VL/OpenSearch-VL-8B: Frontier Multimodal Search Agent
OpenSearch-VL is an open-source recipe for training advanced multimodal deep-research agents, with the 8 billion parameter OpenSearch-VL-8B model built upon the Qwen3-VL architecture. Unlike standard VLMs that provide single-pass answers, this model functions as a closed-loop agent, capable of inspecting images, cropping regions of interest, issuing web and image searches, and visiting retrieved pages to ground its responses in evidence.
Key Capabilities
- Agentic Reinforcement Learning: Utilizes a multi-turn fatal-aware GRPO algorithm to handle cascading tool failures during long rollouts, preserving valid reasoning.
- Comprehensive Tool Use: Equipped with a unified visual and retrieval tool environment including
crop,layout_parsing,text_search,image_search,web_search,visit,sharpen,super_resolution,perspective_correct, andpython_interpreter. - Data Curation Pipeline: Features a novel pipeline built on the Wikipedia hyperlink graph to synthesize image-grounded multi-hop VQA datasets (SearchVL-SFT-36k and SearchVL-RL-8k), designed to suppress shortcut solutions.
- Enhanced Performance: Improves average scores by over 10 points on seven knowledge-intensive multimodal benchmarks (SimpleVQA, VDR, MMSearch, LiveVQA, BrowseComp-VL, FVQA, InfoSeek) compared to agentic baselines.
Good For
- Complex Visual Question Answering: Ideal for questions requiring multi-step reasoning, external knowledge acquisition, and visual evidence analysis.
- Multimodal Search and Retrieval: Excels in scenarios where information needs to be gathered from both visual and textual sources through iterative search.
- Research and Development: Provides a fully open recipe (data, code, and model checkpoints) for reproducing, fine-tuning, and evaluating advanced multimodal agents.