Mininglamp-2718/Mano-CUA-4B-Thinking-1.1
Mano-CUA-4B-Thinking-1.1 by Mininglamp-2718 is a 4 billion parameter GUI-VLA (Visual Language Agent) model, part of the Mano open-source series. Designed for edge devices, it autonomously performs complex desktop GUI operations through visual understanding. This model excels at tasks like cross-system data integration, long-task planning, and intelligent report generation, leveraging a "think-act-verify" loop reasoning mechanism.
Loading preview...
Overview
Mano-CUA-4B-Thinking-1.1 is a 4 billion parameter GUI-VLA (Visual Language Agent) model developed by Mininglamp-2718, specifically engineered for efficient operation on edge devices like Apple Silicon Macs. It is a full-precision (fp16) version within the larger Mano open-source model series, with an MLX 8-bit quantized version also available for optimized local inference.
Key Capabilities
- Complex GUI Automation: Executes intricate interface operations involving numerous interactive elements.
- Cross-System Data Integration: Extracts and combines data from various sources purely through visual interaction, bypassing API dependencies.
- Long-Task Planning: Supports enterprise-level business process automation, handling workflows with dozens to hundreds of steps.
- Intelligent Report Generation: Automatically creates structured documents such such as data analysis reports and work summaries.
Technical Approach
The model utilizes the Mano-Action bidirectional self-reinforcement learning method and a three-stage progressive training approach (SFT → Offline RL → Online RL). It incorporates a "think-act-verify" loop reasoning mechanism for high-precision GUI understanding and operation. Edge device optimization is achieved through mixed-precision quantization, visual token pruning, and adapted inference techniques.
Action Space
Mano-CUA provides a comprehensive action space for GUI interaction, including open_app, open_url, click, type, hotkey, scroll, drag, wait, finish, stop, and call_user for requesting human assistance. The model outputs structured XML with actions and coordinates normalized to a [0, 1000] range.