vocaela/KV-Ground-8B-BaseGuiOwl1.5-0315

VISIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Mar 16, 2026License:cc-by-nc-sa-4.0Architecture:Transformer0.0K Open Weights Cold

KV-Ground-8B-BaseGuiOwl1.5-0315 is an 8 billion parameter Vision-Language Model (VLM) developed by Kingsware and Vocaela AI, fine-tuned from GUI-Owl-1.5-8B-Instruct. This model is specifically optimized for high-resolution GUI grounding tasks, excelling at interpreting images and natural language instructions to produce text outputs. It achieves 73.2 on ScreenSpot-Pro without reasoning CoT and 80.5 with a zoom-in strategy, making it a top performer for GUI automation and interaction.

Loading preview...

Model Overview

KV-Ground-8B-BaseGuiOwl1.5-0315 is an 8 billion parameter Vision-Language Model (VLM) developed by Kingsware and Vocaela AI. It is fine-tuned from the GUI-Owl-1.5-8B-Instruct model, with a primary focus on optimizing performance for high-resolution Graphical User Interface (GUI) grounding tasks. The model takes an image and natural language instruction as input and generates text output.

Key Capabilities & Differentiators

  • High-Resolution GUI Grounding: Specifically trained and optimized for high-resolution GUI images, addressing common performance degradation issues in this domain.
  • Superior Benchmarking: Achieves 73.2 on ScreenSpot-Pro without reasoning CoT, ranking as the best pure model capability across all models in this benchmark. When combined with a zoom-in strategy, it reaches 80.5, making it the top-ranked system.
  • Consistent Performance: Maintains excellent performance on regular-resolution tasks, scoring 94.6 on ScreenSpot-V2, and shows notable gains on OSWorld-G and OSWorld-G-refined.
  • Advanced Training Methodology: Utilizes a unique recipe involving MLLM-as-judge for data cleaning, synthesis of high-quality high-resolution GUI grounding data, and continued post-training via SFT followed by GRPO.

Ideal Use Cases

This model is particularly well-suited for applications requiring precise interaction and understanding of high-resolution graphical interfaces, such as:

  • Automated UI testing and interaction
  • Robotic Process Automation (RPA) for GUI-driven tasks
  • Accessibility tools for navigating complex interfaces
  • Developing intelligent agents that interact with digital environments