vocaela/KV-Ground-8B-BaseGuiOwl1.5-0315
KV-Ground-8B-BaseGuiOwl1.5-0315 is an 8 billion parameter Vision-Language Model (VLM) developed by Kingsware and Vocaela AI, fine-tuned from GUI-Owl-1.5-8B-Instruct. This model is specifically optimized for high-resolution GUI grounding tasks, excelling at interpreting images and natural language instructions to produce text outputs. It achieves 73.2 on ScreenSpot-Pro without reasoning CoT and 80.5 with a zoom-in strategy, making it a top performer for GUI automation and interaction.
Loading preview...
Model Overview
KV-Ground-8B-BaseGuiOwl1.5-0315 is an 8 billion parameter Vision-Language Model (VLM) developed by Kingsware and Vocaela AI. It is fine-tuned from the GUI-Owl-1.5-8B-Instruct model, with a primary focus on optimizing performance for high-resolution Graphical User Interface (GUI) grounding tasks. The model takes an image and natural language instruction as input and generates text output.
Key Capabilities & Differentiators
- High-Resolution GUI Grounding: Specifically trained and optimized for high-resolution GUI images, addressing common performance degradation issues in this domain.
- Superior Benchmarking: Achieves 73.2 on ScreenSpot-Pro without reasoning CoT, ranking as the best pure model capability across all models in this benchmark. When combined with a zoom-in strategy, it reaches 80.5, making it the top-ranked system.
- Consistent Performance: Maintains excellent performance on regular-resolution tasks, scoring 94.6 on ScreenSpot-V2, and shows notable gains on OSWorld-G and OSWorld-G-refined.
- Advanced Training Methodology: Utilizes a unique recipe involving MLLM-as-judge for data cleaning, synthesis of high-quality high-resolution GUI grounding data, and continued post-training via SFT followed by GRPO.
Ideal Use Cases
This model is particularly well-suited for applications requiring precise interaction and understanding of high-resolution graphical interfaces, such as:
- Automated UI testing and interaction
- Robotic Process Automation (RPA) for GUI-driven tasks
- Accessibility tools for navigating complex interfaces
- Developing intelligent agents that interact with digital environments