duvoai/duvo-eye-1
duvoai/duvo-eye-1 is a 35.1 billion parameter Vision-Language Model (VLM) developed by Duvo, optimized for single-step GUI element grounding. This model, built by LoRA-fine-tuning Hcompany/Holo-3.1-35B-A3B, excels at localizing click positions from screenshots and natural-language descriptions. It achieves top performance on UI grounding benchmarks, including #1 on UI-I2E-Bench (84.2) and parity at the top of ScreenSpot-v2 (95.1), particularly strong for enterprise back-office UIs.
Loading preview...
duvo-eye-1: GUI Grounding for Enterprise Computer Use
duvo-eye-1 is a Vision-Language Model (VLM) developed by Duvo, specifically designed for single-step GUI element grounding. Given a screenshot and a natural-language description, it outputs a precise click position {"x","y"}. This model is a LoRA fine-tune of Hcompany/Holo-3.1-35B-A3B, a 35B-A3B MoE with 3B active parameters, trained on synthetic enterprise back-office UIs.
Key Capabilities & Performance
- Top-tier Grounding: Ranks #1 on the maintained UI-I2E-Bench leaderboard (84.2) and achieves 95.1 on ScreenSpot-v2, matching top models. It also exceeds the best published UI-Vision element-grounding number (64.4).
- Efficiency: Achieves these results with only 3B active parameters, offering high performance at a lower serving cost compared to larger models.
- Output Reliability: Significantly improves over its base model by eliminating malformed outputs, ensuring consistent and valid JSON responses.
- In-domain Expertise: Shows substantial gains in its target enterprise-UI domain (SynthUI test: 86.6 vs. 62.5 for the base).
- Single-Shot Excellence: Its 72.9 on ScreenSpot-Pro is the second-highest single-forward-pass result on the public leaderboard, outperforming many larger single models.
Good for
- Automating GUI Interactions: Ideal as the grounding component within a larger agent stack for computer use, resolving "what" to interact with to "where".
- Enterprise Applications: Particularly strong for web, desktop, and professional-software UIs, especially those resembling enterprise back-office systems.
- Multilingual UI Support: While instructions are in English, it supports English, French, and German interfaces, inheriting multilingual capabilities from its base model.
- Reproducible Benchmarking: All public-benchmark predictions are published, and three benchmarks are confirmed under maintainers' own scorers, ensuring transparency and verifiability.