nvidia/Cosmos-Reason1-7B

VISIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:32kPublished:Apr 18, 2025License:nvidia-open-model-licenseArchitecture:Transformer0.2K Open Weights Cold

NVIDIA Cosmos-Reason1-7B is a 7 billion parameter multimodal vision language model (VLM) developed by NVIDIA, designed for physical AI and robotics. Built on Qwen2.5-VL-7B-Instruct, it specializes in embodied reasoning, understanding space, time, and fundamental physics from video and image inputs. The model is post-trained with physical common sense data using supervised fine-tuning and reinforcement learning, enabling it to act as a planning model for embodied agents and excel in robot planning, data curation, and video analytics.

Loading preview...

NVIDIA Cosmos-Reason1-7B: Embodied Reasoning for Physical AI

NVIDIA Cosmos-Reason1-7B is a 7 billion parameter multimodal vision language model (VLM) developed by NVIDIA, specifically engineered for physical AI and robotics applications. This model is built upon the Qwen2.5-VL-7B-Instruct architecture and is distinguished by its ability to reason like humans, incorporating prior knowledge, physics understanding, and common sense to interpret and act within the real world.

Key Capabilities

  • Physical Common Sense and Embodied Reasoning: Excels at understanding space, time, and fundamental physics, enabling robots and AI agents to navigate diverse physical scenarios.
  • Video and Image Input: Processes video and image data alongside text prompts, converting visual inputs into tokens via a vision encoder and projector before feeding them into the LLM core.
  • Chain-of-Thought Reasoning: Utilizes step-by-step reasoning to provide detailed, logical responses, understanding world dynamics without requiring human annotations.
  • Post-Training: Enhanced through supervised fine-tuning and reinforcement learning on physical common sense and embodied reasoning datasets.
  • Commercial Use: Released under the NVIDIA Open Model License, making it suitable for commercial applications.

Use Cases

  • Robot Planning and Reasoning: Serves as a "brain" for methodical decision-making in robot vision language action (VLA) models, allowing robots to interpret environments and execute complex tasks.
  • Data Curation and Annotation: Automates high-quality curation and annotation of large, diverse training datasets for physical AI.
  • Video Analytics AI Agents: Extracts insights and performs root-cause analysis from massive volumes of video data in city and industrial operations.

Technical Details

The model comprises a Vision Transformer (ViT) with 675.76M parameters and a Language Model (LLM) with 7.07B parameters, along with other components. It supports a context length of 32768 tokens and is optimized for NVIDIA GPU-accelerated systems, recommending fps=4 for video input and max_tokens=4096 for output to ensure comprehensive chain-of-thought responses. Evaluation on embodied reasoning benchmarks shows an average accuracy of 65.1% across datasets like RoboVQA, AV, BridgeDataV2, Agibot, HoloAssist, and RoboFail.