BAAI-Agents/EgoActor-8b-Qwen3VL

VISIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Oct 22, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

EgoActor-8b-Qwen3VL by BAAI-Agents is an 8 billion parameter unified vision-language model (VLM) based on Qwen3-VL, designed to translate natural language instructions into precise spatial and temporal action sequences for humanoid robots. It specializes in grounding instructions into egocentric, spatial-aware motor behaviors, bridging abstract task planning and concrete embodied control. This model excels at tasks requiring instruction-to-action grounding, such as mobile manipulation and navigation, using first-person camera input.

Loading preview...

EgoActor-8b-Qwen3VL: Vision-Language Model for Humanoid Robot Control

EgoActor-8b-Qwen3VL, developed by the BAAI-Agents team, is an 8 billion parameter unified vision-language model (VLM) built upon the Qwen3-VL architecture. Its core function is to convert natural language instructions into specific spatial and temporal action sequences for humanoid robots. The model integrates perception, planning, and action execution by grounding instructions into egocentric, spatial-aware motor behaviors, encompassing movement, manipulation, perception, and human interaction.

Key Capabilities

  • Instruction-to-Action Grounding: Translates high-level natural language commands into executable robot actions.
  • Egocentric Vision Processing: Specialized in analyzing first-person view images from embodied robots to inform decision-making.
  • Spatial-Aware Motor Behaviors: Generates precise motor commands for tasks like navigation, object manipulation, and interaction.
  • Multi-Modal Input: Processes mixed text and image content, including historical and recent observation frames, to predict action sequences.

Good For

  • Robotics and Embodied AI Research: Ideal for scenarios requiring instruction-to-action grounding for humanoid robots.
  • Mobile Manipulation Tasks: Suitable for tasks such as approaching and picking up objects based on first-person camera input.
  • Simulation and Real-World Robot Testing: Supports testing in environments where models interact with egocentric vision and spatial reasoning.

Limitations

  • Performance is highly dependent on egocentric RGB inputs; degradation may occur with poor sensor data.
  • Generalization to drastically different robot hardware or unstructured environments may require fine-tuning.
  • Not intended for general LLM capabilities or non-embodied tasks.
  • Physical robot deployment requires significant safety considerations due to potential collision hazards and unexpected movements.