Mininglamp-2718/Mano-CUA-4B-Thinking-1.1

VISIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Jun 1, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

Mano-CUA-4B-Thinking-1.1 by Mininglamp-2718 is a 4 billion parameter GUI-VLA (Visual Language Agent) model, part of the Mano open-source series. Designed for edge devices, it autonomously performs complex desktop GUI operations through visual understanding. This model excels at tasks like cross-system data integration, long-task planning, and intelligent report generation, leveraging a "think-act-verify" loop reasoning mechanism.

Loading preview...

Overview

Mano-CUA-4B-Thinking-1.1 is a 4 billion parameter GUI-VLA (Visual Language Agent) model developed by Mininglamp-2718, specifically engineered for efficient operation on edge devices like Apple Silicon Macs. It is a full-precision (fp16) version within the larger Mano open-source model series, with an MLX 8-bit quantized version also available for optimized local inference.

Key Capabilities

  • Complex GUI Automation: Executes intricate interface operations involving numerous interactive elements.
  • Cross-System Data Integration: Extracts and combines data from various sources purely through visual interaction, bypassing API dependencies.
  • Long-Task Planning: Supports enterprise-level business process automation, handling workflows with dozens to hundreds of steps.
  • Intelligent Report Generation: Automatically creates structured documents such such as data analysis reports and work summaries.

Technical Approach

The model utilizes the Mano-Action bidirectional self-reinforcement learning method and a three-stage progressive training approach (SFT → Offline RL → Online RL). It incorporates a "think-act-verify" loop reasoning mechanism for high-precision GUI understanding and operation. Edge device optimization is achieved through mixed-precision quantization, visual token pruning, and adapted inference techniques.

Action Space

Mano-CUA provides a comprehensive action space for GUI interaction, including open_app, open_url, click, type, hotkey, scroll, drag, wait, finish, stop, and call_user for requesting human assistance. The model outputs structured XML with actions and coordinates normalized to a [0, 1000] range.