nvidia/EGM-8B-SFT
nvidia/EGM-8B-SFT is an 8 billion parameter supervised fine-tuning (SFT) checkpoint from NVIDIA's Efficient Visual Grounding Language Models (EGM) project, built upon Qwen3-VL-8B-Thinking. This intermediate model is specifically designed for visual grounding tasks, learning structured reasoning from proprietary VLM-generated chain-of-thought data. It serves as an initialization point for further reinforcement learning (RL) training to achieve optimal performance in visual grounding applications.
Loading preview...
Overview
nvidia/EGM-8B-SFT is an 8 billion parameter supervised fine-tuning (SFT) checkpoint developed by NVIDIA as part of their Efficient Visual Grounding Language Models (EGM) initiative. It is based on the Qwen3-VL-8B-Thinking architecture and features a 32,768 token context length. This model is an intermediate checkpoint intended for further reinforcement learning (RL) training, with the final, best-performing model available as nvidia/EGM-8B.
Key Capabilities & Training
- Visual Grounding: Specifically fine-tuned for visual grounding tasks, learning to associate language with visual elements.
- Reasoning-Augmented Data: Trained on proprietary VLM-generated chain-of-thought reasoning steps to enable structured visual grounding with explicit reasoning.
- SFT Stage: Represents the supervised fine-tuning stage, serving as the foundational model before the subsequent GRPO (RL) stage.
- Architecture: Utilizes a Qwen3VLForConditionalGeneration architecture with bfloat16 precision, 36 text layers, and 27 vision layers.
Intended Use
- RL Training Initialization: Primarily designed as an initialization checkpoint for developers to conduct their own reinforcement learning training using the EGM framework.
- Research & Development: Suitable for researchers exploring visual grounding, multi-modal models, and RL-based fine-tuning techniques.