UBTECH-Robotics/Thinker-4B

VISIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Dec 30, 2025License:attribution-noncommercial-sharealike4.0internationalArchitecture:Transformer0.0K Cold

UBTECH-Robotics/Thinker-4B is a 4 billion parameter vision-language foundation model developed by Ubtech Thinker Team, specifically engineered for embodied intelligence. It addresses limitations in conventional VLMs by integrating future-state prediction, egocentric spatial intelligence, and temporal understanding. Thinker excels in task planning, visual grounding, and spatial understanding, setting new records across 7 embodied AI benchmarks.

Loading preview...

Thinker: A Vision-Language Foundation Model for Embodied Intelligence

Thinker is a 4 billion parameter vision-language foundation model developed by the Ubtech Thinker Team, designed to bridge the gap between general scene understanding and robust robot-centric task-level capabilities. Unlike conventional VLMs that often struggle with perspective confusion and temporal oversight, Thinker integrates advanced mechanisms to handle these challenges. Its development involved high-quality dataset curation, multi-stage training, and reinforcement learning.

Key Capabilities

  • Task Planning: Incorporates future-state prediction for effective decision-making.
  • Spatial Intelligence: Grounded in an egocentric coordinate system for precise spatial understanding.
  • Temporal Understanding: Integrates historical state information to comprehend dynamic environments.
  • Visual Grounding: Achieves precise visual grounding for accurate object and scene interpretation.

Performance and Use Cases

Thinker has demonstrated advanced capabilities across these four core dimensions, setting new records across 7 embodied AI benchmarks in Task Planning, Visual Grounding, and Spatial Understanding. It significantly outperforms existing open-source, closed-source, and specialized baselines. This model is particularly well-suited as a foundation for embodied intelligence and autonomous robotic decision-making, offering robust solutions for tasks requiring deep understanding of physical environments and robot-centric actions.