nvidia/EGM-4B-SFT
nvidia/EGM-4B-SFT is a 4 billion parameter supervised fine-tuning (SFT) checkpoint from the Efficient Visual Grounding Language Models (EGM) training pipeline, built upon Qwen3-VL-4B-Thinking. Developed by NVIDIA, this model is an intermediate step designed for further reinforcement learning (RL) training to achieve optimal visual grounding performance. It learns structured visual grounding with explicit reasoning by fine-tuning on reasoning-augmented data generated by a proprietary VLM.
Loading preview...
Overview
nvidia/EGM-4B-SFT is a 4 billion parameter supervised fine-tuning (SFT) checkpoint, part of NVIDIA's Efficient Visual Grounding Language Models (EGM) project. This model is an intermediate checkpoint derived from Qwen3-VL-4B-Thinking and is specifically designed to serve as an initialization for subsequent reinforcement learning (RL) training. Its primary purpose is to learn structured visual grounding with explicit reasoning.
Key Capabilities & Training
This SFT model is trained by fine-tuning the base Qwen3-VL-4B-Thinking model on reasoning-augmented visual grounding data. This data includes detailed chain-of-thought reasoning steps generated by a proprietary Visual Language Model (VLM), enabling the model to develop explicit reasoning capabilities for visual grounding tasks. The architecture is based on Qwen3VLForConditionalGeneration and utilizes bfloat16 precision.
Intended Use
nvidia/EGM-4B-SFT is not intended for direct end-user deployment as a final model. Instead, it is specifically provided for developers and researchers who wish to perform further reinforcement learning training. For the best performing, final EGM model, users should refer to nvidia/EGM-4B, which is the result of the subsequent RL stage (GRPO) initialized with this SFT checkpoint.