nvidia/Cosmos-Reason2-32B
NVIDIA Cosmos-Reason2-32B is a 32 billion parameter vision language model (VLM) developed by NVIDIA, designed for physical AI and robotics. It excels at embodied reasoning, understanding space, time, and fundamental physics to enable agents to interpret environments and plan actions. This model supports multimodal inputs (text, video, image) and features enhanced spatio-temporal understanding, object detection with 2D/3D localization, and improved long-context understanding up to 256K tokens.
Loading preview...
NVIDIA Cosmos-Reason2-32B: Physical AI and Embodied Reasoning VLM
NVIDIA Cosmos-Reason2-32B is a 32 billion parameter Vision Language Model (VLM) specifically engineered for physical AI and robotics applications. Developed by NVIDIA, this model is built upon the Qwen3-VL-32B-Instruct architecture and is designed to enable robots and AI agents to reason about the physical world with human-like common sense, incorporating prior knowledge and physics understanding.
Key Capabilities and Features
- Enhanced Physical AI Reasoning: Features improved spatio-temporal understanding and timestamp precision, crucial for dynamic environments.
- Multimodal Input Support: Processes text, video (MP4), and image (JPG) inputs, with a recommended
FPS=4for video to match training. - Object Detection: Supports object detection with 2D/3D point localization and bounding box coordinates, accompanied by reasoning explanations.
- Long-Context Understanding: Offers improved long-context processing, supporting up to 256K input tokens.
- Commercial Use: The model is released under the NVIDIA Open Model License and is ready for commercial deployment.
Performance and Benchmarks
Cosmos-Reason2-32B demonstrates strong performance across various physical AI benchmarks, often outperforming Qwen3-VL-32B-Instruct in categories such as General (75.85% overall), Robotics (60.60% overall), Self-Driving (70.15% overall), and Smart Spaces (77.79% overall). It shows particular strength in tasks requiring physical common sense and embodied reasoning.
Ideal Use Cases
- Robot Planning and Reasoning: Acts as a core component for deliberate decision-making in robot Vision-Language-Action (VLA) models, enabling robots to interpret complex commands and execute tasks with common sense.
- Video Analytics AI Agents: Extracts insights and performs root-cause analysis from video data, suitable for city and industrial operations.
- Data Curation and Annotation: Automates high-quality curation and annotation of large, diverse training datasets for physical AI development.