BAAI-Agents/EgoActor-4b-Qwen3VL
EgoActor-4b-Qwen3VL is a 4 billion parameter unified vision-language model developed by the BAAI-Agents team, fine-tuned from Qwen3-VL. This model specializes in translating natural language instructions into precise spatial and temporal action sequences for humanoid robots, leveraging egocentric visual perception. It is designed for embodied AI research, enabling robots to perform tasks like object manipulation and navigation based on first-person camera input.
Loading preview...
EgoActor-4b-Qwen3VL: Vision-Language Model for Humanoid Robotics
EgoActor-4b-Qwen3VL is a 4 billion parameter vision-language model (VLM) developed by the BAAI-Agents team, building upon the Qwen3-VL architecture. Its core function is to bridge the gap between natural language instructions and concrete robotic actions, specifically for humanoid robots. The model processes egocentric visual input to generate precise spatial and temporal action sequences, integrating perception, planning, and execution.
Key Capabilities
- Instruction-to-Action Grounding: Translates high-level natural language commands into executable motor behaviors for humanoid robots.
- Egocentric Vision Integration: Utilizes first-person camera inputs for spatial reasoning and action generation.
- Unified Perception and Planning: Combines visual perception with task planning to control robot movement, manipulation, and interaction.
- Multi-Modal Input: Supports multi-image vision-language inputs for embodied action prediction, including historical and recent observation frames.
Good for
- Robotics Research: Ideal for researchers in embodied AI focusing on instruction-to-action grounding for humanoid robots.
- Mobile Manipulation Tasks: Suitable for tasks requiring robots to approach, pick up objects, or navigate based on natural language prompts.
- Simulation and Real-World Testing: Designed for use in both simulated and physical robot environments for mobile manipulation.
Limitations and Considerations
- Egocentric Vision Dependence: Performance relies heavily on the quality of egocentric RGB inputs.
- Generalization: May require fine-tuning for drastically different robot hardware or highly unstructured environments.
- Safety Risks: Use in physical robots necessitates appropriate safety controls due to potential for unexpected movements.
- Out-of-Scope: Not intended for general LLM capabilities, natural language dialogue, or high-speed low-level control tasks.