inclusionAI/VISTA-9B

VISIONConcurrency Cost:1Model Size:9BQuant:FP8Ctx Length:32kTool Calling:SupportedPublished:Jun 12, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

VISTA-9B is a 9 billion parameter GUI-grounding vision-language model developed by inclusionAI, built upon the Qwen3.5 backbone. It is specifically designed to map a screenshot and a natural-language instruction to a precise click coordinate within a graphical user interface. This model utilizes View-Consistent Self-Verified Training (VISTA) to enhance localization behavior and achieve high accuracy in GUI grounding tasks, making it suitable for automated UI interaction and testing.

Loading preview...

VISTA-9B: GUI-Grounding Vision-Language Model

VISTA-9B is a 9 billion parameter vision-language model developed by inclusionAI, based on the Qwen3.5 architecture, specifically engineered for GUI grounding. This model excels at interpreting a screenshot and a natural-language instruction to predict a precise click coordinate (in a normalized 0-1000 image frame) on a graphical user interface.

Key Capabilities & Training Innovations

  • GUI Grounding: Maps visual input and text instructions to exact click locations.
  • View-Consistent GRPO Training: Employs a novel training method that builds comparison groups from target-preserving views of the same GUI instance, with exact coordinate remapping across cropped views. This improves localization robustness under varying visual presentations.
  • Self-Verified Cross-View Anchoring: Incorporates a training objective that adds oracle-format center-point anchors only when model-generated rollouts achieve maximum reward, stabilizing short coordinate generation.

Performance Highlights

VISTA-9B demonstrates strong performance on GUI grounding benchmarks, consistently outperforming its Qwen3.5 and GRPO counterparts. For instance, it achieves 69.2% on SSPro, 95.8% on SSV2, 68.1% on OSWorld-G, and 75.5% on OSWorld-G-R, showing improvements across various metrics compared to previous 9B models.

Recommended Use Cases

  • Automated UI Interaction: Ideal for tasks requiring precise interaction with graphical user interfaces based on natural language commands.
  • UI Testing and Automation: Can be used to automate testing workflows by programmatically clicking specific elements on a screen.
  • Robotic Process Automation (RPA): Applicable in scenarios where a robot needs to understand and interact with software interfaces.