Thinker: A Vision-Language Foundation Model for Embodied Intelligence

Thinker is a 4 billion parameter vision-language foundation model developed by the Ubtech Thinker Team, designed to bridge the gap between general scene understanding and robust robot-centric task-level capabilities. Unlike conventional VLMs that often struggle with perspective confusion and temporal oversight, Thinker integrates advanced mechanisms to handle these challenges. Its development involved high-quality dataset curation, multi-stage training, and reinforcement learning.

Key Capabilities

Task Planning: Incorporates future-state prediction for effective decision-making.
Spatial Intelligence: Grounded in an egocentric coordinate system for precise spatial understanding.
Temporal Understanding: Integrates historical state information to comprehend dynamic environments.
Visual Grounding: Achieves precise visual grounding for accurate object and scene interpretation.

Performance and Use Cases

Thinker has demonstrated advanced capabilities across these four core dimensions, setting new records across 7 embodied AI benchmarks in Task Planning, Visual Grounding, and Spatial Understanding. It significantly outperforms existing open-source, closed-source, and specialized baselines. This model is particularly well-suited as a foundation for embodied intelligence and autonomous robotic decision-making, offering robust solutions for tasks requiring deep understanding of physical environments and robot-centric actions.

Overview

Thinker: A Vision-Language Foundation Model for Embodied Intelligence

Key Capabilities

Performance and Use Cases

Full Model Card (README)