vocaela/KV-Ground-4B-BaseQw3vl

VISIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 25, 2026License:cc-by-nc-sa-4.0Architecture:Transformer Open Weights Cold

The vocaela/KV-Ground-4B-BaseQw3vl is a 4 billion parameter Vision-Language Model (VLM) developed by Kingsware and Vocaela AI, fine-tuned from Qwen3-VL-4B-Instruct. This model is specifically optimized for high-resolution Graphical User Interface (GUI) grounding tasks, excelling at interpreting images and natural language instructions to produce text outputs. It achieves 63.2 on ScreenSpot-Pro for high-resolution tasks and maintains 94.6 on ScreenSpot-V2 for regular-resolution tasks, making it a strong performer in its size class for GUI automation and interaction.

Loading preview...

KV-Ground-4B-BaseQw3vl: High-Resolution GUI Grounding VLM

This 4 billion parameter Vision-Language Model (VLM), developed by Kingsware and Vocaela AI, is a specialized version of Qwen3-VL-4B-Instruct. It is meticulously optimized for high-resolution GUI grounding, a critical capability for automating interactions with complex graphical interfaces.

Key Capabilities & Features

  • High-Resolution GUI Grounding: Achieves 63.2 on ScreenSpot-Pro, demonstrating strong performance on high-resolution images, a common challenge for smaller VLMs.
  • Robust Performance on Regular Tasks: Maintains excellent performance on standard-resolution tasks, scoring 94.6 on ScreenSpot-V2, indicating its versatility.
  • Optimized Training Methodology: Utilizes a unique training recipe including MLLM-as-judge data cleaning, synthesis of high-resolution GUI grounding data, and continuous post-training with SFT followed by GRPO.
  • Input/Output: Processes image inputs combined with natural language instructions to generate text outputs.

When to Use This Model

This model is particularly well-suited for applications requiring precise interaction and understanding of graphical user interfaces, especially in scenarios involving high-resolution displays. Its optimization for GUI grounding makes it ideal for:

  • GUI Automation: Automating tasks on desktop or mobile interfaces.
  • Assistive Technologies: Developing tools that help users navigate complex applications.
  • Testing & QA: Automating UI testing processes.

It offers a competitive balance of size and performance, outperforming its base model and several other specialized GUI models in its parameter class on key benchmarks like ScreenSpot-Pro, while not sacrificing performance on regular-resolution tasks.