vocaela/KV-Ground-4B-BaseQw3vl
The vocaela/KV-Ground-4B-BaseQw3vl is a 4 billion parameter Vision-Language Model (VLM) developed by Kingsware and Vocaela AI, fine-tuned from Qwen3-VL-4B-Instruct. This model is specifically optimized for high-resolution Graphical User Interface (GUI) grounding tasks, excelling at interpreting images and natural language instructions to produce text outputs. It achieves 63.2 on ScreenSpot-Pro for high-resolution tasks and maintains 94.6 on ScreenSpot-V2 for regular-resolution tasks, making it a strong performer in its size class for GUI automation and interaction.
Loading preview...
KV-Ground-4B-BaseQw3vl: High-Resolution GUI Grounding VLM
This 4 billion parameter Vision-Language Model (VLM), developed by Kingsware and Vocaela AI, is a specialized version of Qwen3-VL-4B-Instruct. It is meticulously optimized for high-resolution GUI grounding, a critical capability for automating interactions with complex graphical interfaces.
Key Capabilities & Features
- High-Resolution GUI Grounding: Achieves 63.2 on ScreenSpot-Pro, demonstrating strong performance on high-resolution images, a common challenge for smaller VLMs.
- Robust Performance on Regular Tasks: Maintains excellent performance on standard-resolution tasks, scoring 94.6 on ScreenSpot-V2, indicating its versatility.
- Optimized Training Methodology: Utilizes a unique training recipe including MLLM-as-judge data cleaning, synthesis of high-resolution GUI grounding data, and continuous post-training with SFT followed by GRPO.
- Input/Output: Processes image inputs combined with natural language instructions to generate text outputs.
When to Use This Model
This model is particularly well-suited for applications requiring precise interaction and understanding of graphical user interfaces, especially in scenarios involving high-resolution displays. Its optimization for GUI grounding makes it ideal for:
- GUI Automation: Automating tasks on desktop or mobile interfaces.
- Assistive Technologies: Developing tools that help users navigate complex applications.
- Testing & QA: Automating UI testing processes.
It offers a competitive balance of size and performance, outperforming its base model and several other specialized GUI models in its parameter class on key benchmarks like ScreenSpot-Pro, while not sacrificing performance on regular-resolution tasks.