Yuqi-Zhou/GUI-G1-3B-v1
Yuqi-Zhou/GUI-G1-3B-v1 is a 3 billion parameter visual language model developed by Yuqi Zhou and collaborators, specifically designed for visual grounding in GUI agents. This model excels at understanding and locating elements within graphical user interfaces, achieving an average accuracy of 89.8% on ScreenSpotV2 and 37.1% on ScreenSpot-Pro benchmarks. It is optimized for tasks requiring precise identification of UI components, making it suitable for automated GUI interaction and agent development.
Loading preview...
GUI-G1-3B-v1: Visual Grounding for GUI Agents
GUI-G1-3B-v1 is a 3 billion parameter visual language model developed by Yuqi Zhou, focusing on visual grounding within graphical user interfaces. This model is specifically engineered to understand and locate elements in GUIs, a critical capability for developing robust GUI agents.
Key Capabilities
- Superior Visual Grounding: Achieves an average accuracy of 89.8% on the ScreenSpotV2 benchmark and 37.1% on ScreenSpot-Pro, outperforming other models like UI-R1-E-3B and OS-ATLAS-7B in these specialized tasks.
- Efficient Inference: Designed for efficient operation, supporting inference without "thinking" steps, as indicated by its benchmark results.
- GUI Agent Integration: Optimized for applications requiring precise identification and interaction with UI components, as detailed in the associated research paper "GUI-G1: Understanding r1-zero-like training for visual grounding in gui agents."
Good For
- Developing automated GUI interaction systems.
- Building intelligent agents that navigate and operate software interfaces.
- Research and development in visual grounding and human-computer interaction.