nvidia/EGM-8B

VISIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Mar 3, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

The nvidia/EGM-8B model is an 8 billion parameter vision-language model developed by NVIDIA, built upon Qwen3-VL-8B-Thinking. It is specifically optimized for efficient visual grounding tasks, achieving superior accuracy and significantly faster inference compared to much larger models. EGM-8B demonstrates a novel approach to visual grounding by leveraging increased test-time computation to overcome the text understanding limitations of smaller VLMs, making it ideal for applications requiring precise object localization and description.

Loading preview...

EGM-Qwen3-VL-8B: Efficient Visual Grounding Language Model

nvidia/EGM-8B is an 8 billion parameter vision-language model (VLM) developed by NVIDIA, designed for highly efficient and accurate visual grounding. Built on the Qwen3-VL-8B-Thinking architecture, EGM-8B employs a two-stage training pipeline involving supervised fine-tuning (SFT) and reinforcement learning (RL) with Group Relative Policy Optimization (GRPO).

Key Capabilities & Performance

  • Superior Visual Grounding: Achieves an average IoU of 91.4 on the RefCOCO benchmark, a 3.6 IoU improvement over its base model.
  • Outperforms Larger Models: Surpasses the performance of significantly larger models like Qwen3-VL-235B-A22B-Instruct (88.2 avg IoU) and Qwen3-VL-235B-A22B-Thinking (90.7 avg IoU) in visual grounding tasks.
  • Exceptional Inference Speed: Offers 5.9x faster inference than Qwen3-VL-235B and 18.9x faster than Qwen3-VL-235B-Thinking, making it highly efficient for real-time applications.
  • Mitigates Text Understanding Gap: Addresses the primary limitation of small VLMs in handling complex prompts by generating more mid-quality tokens to match the performance of larger models.

Ideal Use Cases

  • Precise Object Localization: Applications requiring accurate bounding box predictions based on natural language descriptions.
  • Efficient VLM Deployment: Scenarios where high performance on visual grounding is needed without the computational overhead of very large models.
  • Research in Visual Grounding: A strong baseline and advancement for exploring efficient VLM architectures and training methodologies.