Name: inclusionAI/VISTA-4B API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: inclusionAI

VISTA-4B: GUI Grounding Vision-Language Model

VISTA-4B is a 4.5 billion parameter vision-language model developed by inclusionAI, based on the Qwen3.5 4B architecture. Its core function is GUI grounding, which involves taking a screenshot and a natural-language instruction, then outputting a precise click coordinate within a normalized 0-1000 image frame.

Key Differentiators & Training

This model is distinguished by its innovative training methodology, VISTA: View-Consistent Self-Verified Training for GUI Grounding:

View-consistent GRPO training: It constructs comparison groups from target-preserving views of the same GUI instance, with exact coordinate remapping. This enhances localization robustness across semantically equivalent but geometrically different screenshots.
Self-verified cross-view anchoring: The training objective incorporates oracle-format center-point anchors only when model-generated rollouts achieve maximum reward, stabilizing coordinate generation without unconditional imitation on failed predictions.

Performance

VISTA-4B demonstrates strong performance on GUI grounding benchmarks, often outperforming its Qwen3.5-4B and GRPO-4B counterparts. For instance, it achieves 64.2% on SSPro and 61.2% on OSWorld-G, showing improvements of +2.0% and +1.3% respectively over Qwen3.5-4B.

Use Cases

VISTA-4B is ideal for applications requiring automated interaction with graphical user interfaces, such as:

Automated UI testing: Precisely locating and interacting with UI elements based on textual instructions.
Robotic process automation (RPA): Enabling bots to navigate and operate software applications more effectively.
Accessibility tools: Assisting users with disabilities by translating commands into specific UI actions.

Overview

VISTA-4B: GUI Grounding Vision-Language Model

Key Differentiators & Training

Performance

Use Cases

Full Model Card (README)