OpenGVLab/ScaleCUA-32B

VISIONConcurrency Cost:2Model Size:32BQuant:FP8Ctx Length:32kPublished:Sep 16, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

ScaleCUA-32B by OpenGVLab is a 32 billion parameter Vision-Language Model designed as a cross-platform computer use agent, capable of automating interactions with graphical user interfaces across 6 operating systems and 3 task domains. It excels at GUI understanding, grounding, and task completion, achieving state-of-the-art performance on benchmarks like MMBench-GUI L1-Hard (94.4%), OSWorld-G (60.6%), and WebArena-Lite-v2 (47.4%). The model is optimized for seamless operation across heterogeneous platforms, making it suitable for general computer use automation.

Loading preview...

ScaleCUA-32B: Cross-Platform Computer Use Agent

ScaleCUA-32B is a 32 billion parameter Vision-Language Model developed by OpenGVLab, specifically designed to function as a versatile computer use agent. It addresses the need for open-source models capable of automating interactions across diverse graphical user interfaces.

Key Capabilities

  • Cross-Platform Operation: Trained on a novel, large-scale dataset spanning 6 operating systems and 3 task domains, enabling seamless interaction across heterogeneous platforms.
  • GUI Understanding & Grounding: Demonstrates strong performance in interpreting visual interfaces and grounding actions.
  • Task Automation: Capable of completing complex, multi-step tasks through two primary modes:
    • Direct Action Mode: For immediate, executable actions based on visual input, such as clicking specific UI elements.
    • Reasoned Action Mode: For complex tasks, where the model first reasons through the problem, states its intended operation, and then generates corresponding action code.
  • State-of-the-Art Performance: Achieves high success rates on various benchmarks, including +26.6 points on WebArena-Lite-v2 and +10.7 points on ScreenSpot-Pro compared to baselines. It also boasts 94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, and 47.4% on WebArena-Lite-v2.

Good For

  • Developing autonomous agents for desktop, mobile, and web environments.
  • Automating repetitive or complex GUI-based tasks.
  • Research in computer vision, natural language processing, and agentic AI, particularly for cross-platform interaction.