nvidia/Cosmos-Reason1-7B
NVIDIA Cosmos-Reason1-7B is a 7 billion parameter multimodal vision language model (VLM) developed by NVIDIA, designed for physical AI and robotics. Built on Qwen2.5-VL-7B-Instruct, it specializes in embodied reasoning, understanding space, time, and fundamental physics from video and image inputs. The model is post-trained with physical common sense data using supervised fine-tuning and reinforcement learning, enabling it to act as a planning model for embodied agents and excel in robot planning, data curation, and video analytics.
Loading preview...
NVIDIA Cosmos-Reason1-7B: Embodied Reasoning for Physical AI
NVIDIA Cosmos-Reason1-7B is a 7 billion parameter multimodal vision language model (VLM) developed by NVIDIA, specifically engineered for physical AI and robotics applications. This model is built upon the Qwen2.5-VL-7B-Instruct architecture and is distinguished by its ability to reason like humans, incorporating prior knowledge, physics understanding, and common sense to interpret and act within the real world.
Key Capabilities
- Physical Common Sense and Embodied Reasoning: Excels at understanding space, time, and fundamental physics, enabling robots and AI agents to navigate diverse physical scenarios.
- Video and Image Input: Processes video and image data alongside text prompts, converting visual inputs into tokens via a vision encoder and projector before feeding them into the LLM core.
- Chain-of-Thought Reasoning: Utilizes step-by-step reasoning to provide detailed, logical responses, understanding world dynamics without requiring human annotations.
- Post-Training: Enhanced through supervised fine-tuning and reinforcement learning on physical common sense and embodied reasoning datasets.
- Commercial Use: Released under the NVIDIA Open Model License, making it suitable for commercial applications.
Use Cases
- Robot Planning and Reasoning: Serves as a "brain" for methodical decision-making in robot vision language action (VLA) models, allowing robots to interpret environments and execute complex tasks.
- Data Curation and Annotation: Automates high-quality curation and annotation of large, diverse training datasets for physical AI.
- Video Analytics AI Agents: Extracts insights and performs root-cause analysis from massive volumes of video data in city and industrial operations.
Technical Details
The model comprises a Vision Transformer (ViT) with 675.76M parameters and a Language Model (LLM) with 7.07B parameters, along with other components. It supports a context length of 32768 tokens and is optimized for NVIDIA GPU-accelerated systems, recommending fps=4 for video input and max_tokens=4096 for output to ensure comprehensive chain-of-thought responses. Evaluation on embodied reasoning benchmarks shows an average accuracy of 65.1% across datasets like RoboVQA, AV, BridgeDataV2, Agibot, HoloAssist, and RoboFail.