duvoai/duvo-eye-1

TEXT GENERATIONConcurrency Cost:3Model Size:35.1BQuant:FP8Ctx Length:32kTool Calling:SupportedPublished:Jun 12, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

duvoai/duvo-eye-1 is a 35.1 billion parameter Vision-Language Model (VLM) developed by Duvo, optimized for single-step GUI element grounding. This model, built by LoRA-fine-tuning Hcompany/Holo-3.1-35B-A3B, excels at localizing click positions from screenshots and natural-language descriptions. It achieves top performance on UI grounding benchmarks, including #1 on UI-I2E-Bench (84.2) and parity at the top of ScreenSpot-v2 (95.1), particularly strong for enterprise back-office UIs.

Loading preview...

duvo-eye-1: GUI Grounding for Enterprise Computer Use

duvo-eye-1 is a Vision-Language Model (VLM) developed by Duvo, specifically designed for single-step GUI element grounding. Given a screenshot and a natural-language description, it outputs a precise click position {"x","y"}. This model is a LoRA fine-tune of Hcompany/Holo-3.1-35B-A3B, a 35B-A3B MoE with 3B active parameters, trained on synthetic enterprise back-office UIs.

Key Capabilities & Performance

  • Top-tier Grounding: Ranks #1 on the maintained UI-I2E-Bench leaderboard (84.2) and achieves 95.1 on ScreenSpot-v2, matching top models. It also exceeds the best published UI-Vision element-grounding number (64.4).
  • Efficiency: Achieves these results with only 3B active parameters, offering high performance at a lower serving cost compared to larger models.
  • Output Reliability: Significantly improves over its base model by eliminating malformed outputs, ensuring consistent and valid JSON responses.
  • In-domain Expertise: Shows substantial gains in its target enterprise-UI domain (SynthUI test: 86.6 vs. 62.5 for the base).
  • Single-Shot Excellence: Its 72.9 on ScreenSpot-Pro is the second-highest single-forward-pass result on the public leaderboard, outperforming many larger single models.

Good for

  • Automating GUI Interactions: Ideal as the grounding component within a larger agent stack for computer use, resolving "what" to interact with to "where".
  • Enterprise Applications: Particularly strong for web, desktop, and professional-software UIs, especially those resembling enterprise back-office systems.
  • Multilingual UI Support: While instructions are in English, it supports English, French, and German interfaces, inheriting multilingual capabilities from its base model.
  • Reproducible Benchmarking: All public-benchmark predictions are published, and three benchmarks are confirmed under maintainers' own scorers, ensuring transparency and verifiability.