OpenGVLab/ScaleCUA-32B
ScaleCUA-32B by OpenGVLab is a 32 billion parameter Vision-Language Model designed as a cross-platform computer use agent, capable of automating interactions with graphical user interfaces across 6 operating systems and 3 task domains. It excels at GUI understanding, grounding, and task completion, achieving state-of-the-art performance on benchmarks like MMBench-GUI L1-Hard (94.4%), OSWorld-G (60.6%), and WebArena-Lite-v2 (47.4%). The model is optimized for seamless operation across heterogeneous platforms, making it suitable for general computer use automation.
Loading preview...
ScaleCUA-32B: Cross-Platform Computer Use Agent
ScaleCUA-32B is a 32 billion parameter Vision-Language Model developed by OpenGVLab, specifically designed to function as a versatile computer use agent. It addresses the need for open-source models capable of automating interactions across diverse graphical user interfaces.
Key Capabilities
- Cross-Platform Operation: Trained on a novel, large-scale dataset spanning 6 operating systems and 3 task domains, enabling seamless interaction across heterogeneous platforms.
- GUI Understanding & Grounding: Demonstrates strong performance in interpreting visual interfaces and grounding actions.
- Task Automation: Capable of completing complex, multi-step tasks through two primary modes:
- Direct Action Mode: For immediate, executable actions based on visual input, such as clicking specific UI elements.
- Reasoned Action Mode: For complex tasks, where the model first reasons through the problem, states its intended operation, and then generates corresponding action code.
- State-of-the-Art Performance: Achieves high success rates on various benchmarks, including +26.6 points on WebArena-Lite-v2 and +10.7 points on ScreenSpot-Pro compared to baselines. It also boasts 94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, and 47.4% on WebArena-Lite-v2.
Good For
- Developing autonomous agents for desktop, mobile, and web environments.
- Automating repetitive or complex GUI-based tasks.
- Research in computer vision, natural language processing, and agentic AI, particularly for cross-platform interaction.