nvidia/EGM-8B-SFT

VISIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Apr 2, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

nvidia/EGM-8B-SFT is an 8 billion parameter supervised fine-tuning (SFT) checkpoint from NVIDIA's Efficient Visual Grounding Language Models (EGM) project, built upon Qwen3-VL-8B-Thinking. This intermediate model is specifically designed for visual grounding tasks, learning structured reasoning from proprietary VLM-generated chain-of-thought data. It serves as an initialization point for further reinforcement learning (RL) training to achieve optimal performance in visual grounding applications.

Loading preview...

Overview

nvidia/EGM-8B-SFT is an 8 billion parameter supervised fine-tuning (SFT) checkpoint developed by NVIDIA as part of their Efficient Visual Grounding Language Models (EGM) initiative. It is based on the Qwen3-VL-8B-Thinking architecture and features a 32,768 token context length. This model is an intermediate checkpoint intended for further reinforcement learning (RL) training, with the final, best-performing model available as nvidia/EGM-8B.

Key Capabilities & Training

  • Visual Grounding: Specifically fine-tuned for visual grounding tasks, learning to associate language with visual elements.
  • Reasoning-Augmented Data: Trained on proprietary VLM-generated chain-of-thought reasoning steps to enable structured visual grounding with explicit reasoning.
  • SFT Stage: Represents the supervised fine-tuning stage, serving as the foundational model before the subsequent GRPO (RL) stage.
  • Architecture: Utilizes a Qwen3VLForConditionalGeneration architecture with bfloat16 precision, 36 text layers, and 27 vision layers.

Intended Use

  • RL Training Initialization: Primarily designed as an initialization checkpoint for developers to conduct their own reinforcement learning training using the EGM framework.
  • Research & Development: Suitable for researchers exploring visual grounding, multi-modal models, and RL-based fine-tuning techniques.