Name: nvidia/EGM-8B API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: nvidia

EGM-Qwen3-VL-8B: Efficient Visual Grounding Language Model

nvidia/EGM-8B is an 8 billion parameter vision-language model (VLM) developed by NVIDIA, designed for highly efficient and accurate visual grounding. Built on the Qwen3-VL-8B-Thinking architecture, EGM-8B employs a two-stage training pipeline involving supervised fine-tuning (SFT) and reinforcement learning (RL) with Group Relative Policy Optimization (GRPO).

Key Capabilities & Performance

Superior Visual Grounding: Achieves an average IoU of 91.4 on the RefCOCO benchmark, a 3.6 IoU improvement over its base model.
Outperforms Larger Models: Surpasses the performance of significantly larger models like Qwen3-VL-235B-A22B-Instruct (88.2 avg IoU) and Qwen3-VL-235B-A22B-Thinking (90.7 avg IoU) in visual grounding tasks.
Exceptional Inference Speed: Offers 5.9x faster inference than Qwen3-VL-235B and 18.9x faster than Qwen3-VL-235B-Thinking, making it highly efficient for real-time applications.
Mitigates Text Understanding Gap: Addresses the primary limitation of small VLMs in handling complex prompts by generating more mid-quality tokens to match the performance of larger models.

Ideal Use Cases

Precise Object Localization: Applications requiring accurate bounding box predictions based on natural language descriptions.
Efficient VLM Deployment: Scenarios where high performance on visual grounding is needed without the computational overhead of very large models.
Research in Visual Grounding: A strong baseline and advancement for exploring efficient VLM architectures and training methodologies.

Overview

EGM-Qwen3-VL-8B: Efficient Visual Grounding Language Model

Key Capabilities & Performance

Ideal Use Cases

Full Model Card (README)