inclusionAI/VISTA-4B
inclusionAI/VISTA-4B is a 4.5 billion parameter GUI-grounding vision-language model, built upon the Qwen3.5 4B backbone. It is specifically designed to interpret natural language instructions and map them to precise click coordinates on a graphical user interface (GUI) screenshot. Utilizing View-Consistent Self-Verified Training (VISTA), this model excels at robust GUI element localization, even under varying visual perspectives, making it highly effective for automated UI interaction tasks.
Loading preview...
VISTA-4B: GUI Grounding Vision-Language Model
VISTA-4B is a 4.5 billion parameter vision-language model developed by inclusionAI, based on the Qwen3.5 4B architecture. Its core function is GUI grounding, which involves taking a screenshot and a natural-language instruction, then outputting a precise click coordinate within a normalized 0-1000 image frame.
Key Differentiators & Training
This model is distinguished by its innovative training methodology, VISTA: View-Consistent Self-Verified Training for GUI Grounding:
- View-consistent GRPO training: It constructs comparison groups from target-preserving views of the same GUI instance, with exact coordinate remapping. This enhances localization robustness across semantically equivalent but geometrically different screenshots.
- Self-verified cross-view anchoring: The training objective incorporates oracle-format center-point anchors only when model-generated rollouts achieve maximum reward, stabilizing coordinate generation without unconditional imitation on failed predictions.
Performance
VISTA-4B demonstrates strong performance on GUI grounding benchmarks, often outperforming its Qwen3.5-4B and GRPO-4B counterparts. For instance, it achieves 64.2% on SSPro and 61.2% on OSWorld-G, showing improvements of +2.0% and +1.3% respectively over Qwen3.5-4B.
Use Cases
VISTA-4B is ideal for applications requiring automated interaction with graphical user interfaces, such as:
- Automated UI testing: Precisely locating and interacting with UI elements based on textual instructions.
- Robotic process automation (RPA): Enabling bots to navigate and operate software applications more effectively.
- Accessibility tools: Assisting users with disabilities by translating commands into specific UI actions.