mPLUG/ToolCUA-8B

VISIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:May 12, 2026License:mitArchitecture:Transformer0.0K Open Weights Cold

ToolCUA-8B is an 8 billion parameter model developed by mPLUG, designed as an end-to-end computer-use agent for orchestrating GUI actions and structured tool calls. It specializes in learning to navigate between GUI interactions and tool invocations to achieve desktop tasks. The model is optimized for shorter and more reliable task trajectories, demonstrating improved accuracy and reduced completion steps on OSWorld-MCP tasks.

Loading preview...

ToolCUA-8B: Computer-Use Agent for GUI and Tool Orchestration

ToolCUA-8B is an 8 billion parameter model developed by mPLUG, engineered as an end-to-end agent for automating desktop tasks by orchestrating both Graphical User Interface (GUI) actions and structured tool calls. Its core innovation lies in its ability to intelligently decide when to interact with a GUI, when to invoke external tools, and when to switch between these modalities, leading to more efficient and reliable task completion.

Key Capabilities

  • End-to-End Computer Use: Orchestrates complex desktop workflows by combining GUI interactions and tool usage.
  • Intelligent Path Selection: Learns optimal switching decisions between GUI and tool invocation for shorter and more reliable task trajectories.
  • Staged Training Pipeline: Utilizes trajectory-aware tool synthesis, Tool-Bootstrapped GUI RFT for local switching, and Online Agentic RL with a Tool-Efficient Path Reward.

Performance Highlights

On feasible OSWorld-MCP tasks, ToolCUA-8B achieves 46.85% overall accuracy, a 24.32% Tool Invocation Rate (TIR), and an average of 14.93 completion steps (ACS). It significantly outperforms Qwen3-VL-8B-Instruct on these metrics, showing an +18.62% improvement in accuracy, +15.91% in TIR, and a reduction of 4.41 steps in ACS. This demonstrates its effectiveness in automating computer-use tasks with enhanced precision and efficiency.

Good For

  • Automating complex desktop workflows requiring both GUI interaction and tool utilization.
  • Developing agents that need to make dynamic decisions between different interaction modalities.
  • Applications demanding high accuracy and efficiency in computer-use tasks.