Name: nvidia/Cosmos-Reason1-7B API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: nvidia

NVIDIA Cosmos-Reason1-7B: Embodied Reasoning for Physical AI

NVIDIA Cosmos-Reason1-7B is a 7 billion parameter multimodal vision language model (VLM) developed by NVIDIA, specifically engineered for physical AI and robotics applications. This model is built upon the Qwen2.5-VL-7B-Instruct architecture and is distinguished by its ability to reason like humans, incorporating prior knowledge, physics understanding, and common sense to interpret and act within the real world.

Key Capabilities

Physical Common Sense and Embodied Reasoning: Excels at understanding space, time, and fundamental physics, enabling robots and AI agents to navigate diverse physical scenarios.
Video and Image Input: Processes video and image data alongside text prompts, converting visual inputs into tokens via a vision encoder and projector before feeding them into the LLM core.
Chain-of-Thought Reasoning: Utilizes step-by-step reasoning to provide detailed, logical responses, understanding world dynamics without requiring human annotations.
Post-Training: Enhanced through supervised fine-tuning and reinforcement learning on physical common sense and embodied reasoning datasets.
Commercial Use: Released under the NVIDIA Open Model License, making it suitable for commercial applications.

Use Cases

Robot Planning and Reasoning: Serves as a "brain" for methodical decision-making in robot vision language action (VLA) models, allowing robots to interpret environments and execute complex tasks.
Data Curation and Annotation: Automates high-quality curation and annotation of large, diverse training datasets for physical AI.
Video Analytics AI Agents: Extracts insights and performs root-cause analysis from massive volumes of video data in city and industrial operations.

Technical Details

The model comprises a Vision Transformer (ViT) with 675.76M parameters and a Language Model (LLM) with 7.07B parameters, along with other components. It supports a context length of 32768 tokens and is optimized for NVIDIA GPU-accelerated systems, recommending fps=4 for video input and max_tokens=4096 for output to ensure comprehensive chain-of-thought responses. Evaluation on embodied reasoning benchmarks shows an average accuracy of 65.1% across datasets like RoboVQA, AV, BridgeDataV2, Agibot, HoloAssist, and RoboFail.

Overview

NVIDIA Cosmos-Reason1-7B: Embodied Reasoning for Physical AI

Key Capabilities

Use Cases

Technical Details

Full Model Card (README)