inclusionAI/VISTA-4B

VISIONConcurrency Cost:1Model Size:4.5BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Jun 12, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

inclusionAI/VISTA-4B is a 4.5 billion parameter GUI-grounding vision-language model, built upon the Qwen3.5 4B backbone. It is specifically designed to interpret natural language instructions and map them to precise click coordinates on a graphical user interface (GUI) screenshot. Utilizing View-Consistent Self-Verified Training (VISTA), this model excels at robust GUI element localization, even under varying visual perspectives, making it highly effective for automated UI interaction tasks.

Loading preview...

VISTA-4B: GUI Grounding Vision-Language Model

VISTA-4B is a 4.5 billion parameter vision-language model developed by inclusionAI, based on the Qwen3.5 4B architecture. Its core function is GUI grounding, which involves taking a screenshot and a natural-language instruction, then outputting a precise click coordinate within a normalized 0-1000 image frame.

Key Differentiators & Training

This model is distinguished by its innovative training methodology, VISTA: View-Consistent Self-Verified Training for GUI Grounding:

  • View-consistent GRPO training: It constructs comparison groups from target-preserving views of the same GUI instance, with exact coordinate remapping. This enhances localization robustness across semantically equivalent but geometrically different screenshots.
  • Self-verified cross-view anchoring: The training objective incorporates oracle-format center-point anchors only when model-generated rollouts achieve maximum reward, stabilizing coordinate generation without unconditional imitation on failed predictions.

Performance

VISTA-4B demonstrates strong performance on GUI grounding benchmarks, often outperforming its Qwen3.5-4B and GRPO-4B counterparts. For instance, it achieves 64.2% on SSPro and 61.2% on OSWorld-G, showing improvements of +2.0% and +1.3% respectively over Qwen3.5-4B.

Use Cases

VISTA-4B is ideal for applications requiring automated interaction with graphical user interfaces, such as:

  • Automated UI testing: Precisely locating and interacting with UI elements based on textual instructions.
  • Robotic process automation (RPA): Enabling bots to navigate and operate software applications more effectively.
  • Accessibility tools: Assisting users with disabilities by translating commands into specific UI actions.