vocaela/KV-Ground-4B-BaseGuiOwl1.5-0228
The KV-Ground-4B-BaseGuiOwl1.5-0228 is a 4 billion parameter Vision-Language Model (VLM) developed by Kingsware and Vocaela AI, specifically designed for GUI grounding. Fine-tuned from GUI-Owl-1.5-4B-Instruct, this model excels at processing high-resolution images and natural language instructions to produce text outputs. It achieves 67.0 on ScreenSpot-Pro, making it a leading 4B model for high-resolution GUI grounding, while maintaining strong performance on regular-resolution tasks with 94.1 on ScreenSpot-V2.
Loading preview...
Overview
The KV-Ground-4B-BaseGuiOwl1.5-0228 is a 4 billion parameter Vision-Language Model (VLM) developed by Kingsware and Vocaela AI. It is specifically optimized for GUI grounding tasks, particularly with high-resolution images. This model is a fine-tuned version of the GUI-Owl-1.5-4B-Instruct, inheriting its architecture and configurations.
Key Capabilities & Differentiators
- High-Resolution GUI Grounding: Achieves 67.0 on ScreenSpot-Pro, positioning it as the best-performing 4B model for high-resolution GUI grounding without requiring Chain-of-Thought (CoT) reasoning.
- Robust Performance on Regular Resolution: Maintains excellent performance on standard resolution tasks, scoring 94.1 on ScreenSpot-V2, indicating its versatility.
- Optimized Training Methodology: Developed using a unique approach involving:
- Data Cleaning: Utilizes MLLM as a judge for multiple rounds of data cleaning to address ~30% errors in public GUI grounding datasets, which significantly improves performance on high-resolution images.
- Synthesized High-Resolution Data: Incorporates high-quality, synthesized high-resolution GUI grounding data.
- Continuous Post-Training: Employs Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from Human Feedback (GRPO) for continuous improvement.
- Benchmark Performance: Demonstrates consistent improvements across various benchmarks compared to its base model, GUI-Owl-1.5-4B-Instruct, and competes effectively with other specialized GUI models under 8B parameters.
Use Cases
This model is ideal for applications requiring precise GUI element identification and interaction based on visual input and natural language commands, especially in scenarios involving detailed or high-resolution user interfaces. Its strengths make it suitable for automated UI testing, accessibility tools, and intelligent agents interacting with graphical user interfaces.