nvidia/EGM-8B
The nvidia/EGM-8B model is an 8 billion parameter vision-language model developed by NVIDIA, built upon Qwen3-VL-8B-Thinking. It is specifically optimized for efficient visual grounding tasks, achieving superior accuracy and significantly faster inference compared to much larger models. EGM-8B demonstrates a novel approach to visual grounding by leveraging increased test-time computation to overcome the text understanding limitations of smaller VLMs, making it ideal for applications requiring precise object localization and description.
Loading preview...
EGM-Qwen3-VL-8B: Efficient Visual Grounding Language Model
nvidia/EGM-8B is an 8 billion parameter vision-language model (VLM) developed by NVIDIA, designed for highly efficient and accurate visual grounding. Built on the Qwen3-VL-8B-Thinking architecture, EGM-8B employs a two-stage training pipeline involving supervised fine-tuning (SFT) and reinforcement learning (RL) with Group Relative Policy Optimization (GRPO).
Key Capabilities & Performance
- Superior Visual Grounding: Achieves an average IoU of 91.4 on the RefCOCO benchmark, a 3.6 IoU improvement over its base model.
- Outperforms Larger Models: Surpasses the performance of significantly larger models like Qwen3-VL-235B-A22B-Instruct (88.2 avg IoU) and Qwen3-VL-235B-A22B-Thinking (90.7 avg IoU) in visual grounding tasks.
- Exceptional Inference Speed: Offers 5.9x faster inference than Qwen3-VL-235B and 18.9x faster than Qwen3-VL-235B-Thinking, making it highly efficient for real-time applications.
- Mitigates Text Understanding Gap: Addresses the primary limitation of small VLMs in handling complex prompts by generating more mid-quality tokens to match the performance of larger models.
Ideal Use Cases
- Precise Object Localization: Applications requiring accurate bounding box predictions based on natural language descriptions.
- Efficient VLM Deployment: Scenarios where high performance on visual grounding is needed without the computational overhead of very large models.
- Research in Visual Grounding: A strong baseline and advancement for exploring efficient VLM architectures and training methodologies.