IffYuan/Embodied-R1.5
IffYuan/Embodied-R1.5 is an 8 billion parameter Embodied Foundation Model (EFM) built on Qwen3-VL-8B-Instruct, designed for comprehensive embodied reasoning. It unifies spatial cognition, task planning, and embodied pointing, enabling a Planner-Grounder-Corrector (PGC) closed-loop framework for autonomous long-horizon real-world tasks. Trained on a 15B-token corpus, it excels across 24 embodied VLM benchmarks and generalizes zero-shot to real robots for instruction following and manipulation.
Loading preview...
Embodied-R1.5: Unified Embodied Foundation Model
Embodied-R1.5, developed by IffYuan, is an 8 billion parameter Embodied Foundation Model (EFM) based on Qwen3-VL-8B-Instruct. It integrates comprehensive embodied reasoning within a single architecture, moving beyond specialized pointing to unify three core capabilities: spatial cognition & reasoning, task planning & correction, and embodied pointing & location.
Key Capabilities & Features
- Comprehensive Embodied Reasoning: Understands physical world semantics, geometric relations, and dynamic interaction possibilities.
- Full Task Life Cycle Management: Handles long-horizon task decomposition, next-step planning, process detection, error localization, and correction.
- Grounding High-Level Reasoning: Translates reasoning into coordinates and trajectories, including referring expression grounding, region-level localization, functional grounding, and visual trace generation.
- Planner-Grounder-Corrector (PGC) Framework: Operates as a closed-loop system where the model acts as planner, grounder, and corrector for autonomous task completion.
- Strong Benchmark Performance: Achieves an average of 70.4% across 24 embodied VLM benchmarks, outperforming models like Gemini-Robotics-ER-1.5.
- Real-World Generalization: Demonstrates zero-shot generalization to real robots for instruction following, affordance grounding, and articulated manipulation.
- Structured Output: Follows Qwen3-VL chat format, outputting structured answers within
<answer>...</answer>tags for various task types including spatial grounding, point, and trace data.
Use Cases
- Robotics: Ideal for robotic manipulation, instruction following, and autonomous task execution in real-world environments.
- Embodied AI Research: Provides a unified model for exploring and advancing embodied reasoning and physical intelligence.
- Vision-Language Tasks: Excels in tasks requiring deep understanding of visual scenes combined with linguistic instructions for physical interaction.