OpenSearch-VL/OpenSearch-VL-8B

VISIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Mar 5, 2026Architecture:Transformer0.0K Cold

OpenSearch-VL/OpenSearch-VL-8B is an 8 billion parameter multimodal deep-research agent developed by the OpenSearch-VL project, built on Qwen3-VL. This model is designed for complex, knowledge-intensive visual question answering, operating as a closed-loop agent that inspects images, performs web and image searches, and uses various visual tools to ground its answers in gathered evidence. It excels at multi-turn reasoning and tool use, significantly outperforming agentic baselines on several multimodal benchmarks.

Loading preview...

OpenSearch-VL/OpenSearch-VL-8B: Frontier Multimodal Search Agent

OpenSearch-VL is an open-source recipe for training advanced multimodal deep-research agents, with the 8 billion parameter OpenSearch-VL-8B model built upon the Qwen3-VL architecture. Unlike standard VLMs that provide single-pass answers, this model functions as a closed-loop agent, capable of inspecting images, cropping regions of interest, issuing web and image searches, and visiting retrieved pages to ground its responses in evidence.

Key Capabilities

  • Agentic Reinforcement Learning: Utilizes a multi-turn fatal-aware GRPO algorithm to handle cascading tool failures during long rollouts, preserving valid reasoning.
  • Comprehensive Tool Use: Equipped with a unified visual and retrieval tool environment including crop, layout_parsing, text_search, image_search, web_search, visit, sharpen, super_resolution, perspective_correct, and python_interpreter.
  • Data Curation Pipeline: Features a novel pipeline built on the Wikipedia hyperlink graph to synthesize image-grounded multi-hop VQA datasets (SearchVL-SFT-36k and SearchVL-RL-8k), designed to suppress shortcut solutions.
  • Enhanced Performance: Improves average scores by over 10 points on seven knowledge-intensive multimodal benchmarks (SimpleVQA, VDR, MMSearch, LiveVQA, BrowseComp-VL, FVQA, InfoSeek) compared to agentic baselines.

Good For

  • Complex Visual Question Answering: Ideal for questions requiring multi-step reasoning, external knowledge acquisition, and visual evidence analysis.
  • Multimodal Search and Retrieval: Excels in scenarios where information needs to be gathered from both visual and textual sources through iterative search.
  • Research and Development: Provides a fully open recipe (data, code, and model checkpoints) for reproducing, fine-tuning, and evaluating advanced multimodal agents.