HelloKKMe/GTA1-32B

VISIONConcurrency Cost:2Model Size:32BQuant:FP8Ctx Length:32kPublished:Jun 4, 2025Architecture:Transformer0.0K Cold

HelloKKMe/GTA1-32B is a 32 billion parameter vision-language model developed by HelloKKMe, specifically designed for GUI grounding tasks. Utilizing Reinforcement Learning (RL) with GRPO, this model excels at accurately locating UI elements by directly incentivizing actionable responses rather than verbose Chain-of-Thought reasoning. It achieves state-of-the-art performance on challenging GUI grounding benchmarks like ScreenSpot-V2, ScreenSpotPro, and OSWORLD-G, making it ideal for applications requiring precise UI interaction.

Loading preview...

HelloKKMe/GTA1-32B: State-of-the-Art GUI Grounding Model

HelloKKMe/GTA1-32B is a 32 billion parameter vision-language model developed by HelloKKMe, specifically engineered for Graphical User Interface (GUI) grounding. This model leverages Reinforcement Learning (RL) with GRPO (Grounding Reinforcement Policy Optimization) to achieve superior performance in identifying and locating UI elements.

Key Capabilities & Differentiators

  • Direct Objective Alignment: Unlike models relying on extensive Chain-of-Thought (CoT) reasoning, GTA1-32B directly rewards successful clicks and actionable responses, leading to more grounded and precise UI element identification.
  • State-of-the-Art Performance: The model consistently achieves leading results across challenging GUI grounding datasets, including ScreenSpot-V2 (93.2%), ScreenSpotPro (53.6%), and OSWORLD-G (61.9%).
  • Benchmarking Excellence: The 32B variant shows significant performance improvements (e.g., +1.3% on ScreenSpot-V2, +5.6% on ScreenSpotPro, +2.3% on OSWORLD-G) compared to its baseline models like Qwen2.5-VL-32B-Instruct.
  • Optimized for UI Interaction: Its training methodology makes it highly effective for tasks requiring accurate localization of interactive elements within a GUI.

When to Use This Model

  • Automated UI Testing: For precisely locating and interacting with UI elements in automated testing frameworks.
  • Robotic Process Automation (RPA): To enable robots to accurately identify and click on specific GUI components.
  • Accessibility Tools: Developing tools that assist users in navigating and interacting with complex interfaces.
  • Any application requiring highly accurate GUI element localization.

This model is particularly suited for scenarios where direct, grounded interaction with visual interfaces is paramount, offering a robust solution for complex GUI automation and understanding.