Overview of Qwen3-VL-8B-Instruct-SFT-PRISM-GRPO

This model, Qwen3-VL-8B-Instruct-SFT-PRISM-GRPO, is an 8 billion parameter large multimodal model (LMM) built upon the Qwen3-VL-8B-Instruct architecture. It distinguishes itself through the application of PRISM, a novel three-stage post-training pipeline designed to mitigate distributional drift often encountered in standard Supervised Fine-Tuning (SFT) followed by Reinforcement Learning with Verifiable Rewards (RLVR) approaches.

Key Capabilities and Innovations

PRISM Pipeline: Introduces an explicit pre-alignment stage between SFT and RLVR. This stage utilizes black-box adversarial on-policy distillation with a Mixture-of-Experts discriminator.
Corrective Signals: Provides separate corrective signals specifically for visual grounding and reasoning consistency, addressing distinct error types in multimodal reasoning.
Enhanced Multimodal Reasoning: Aims to improve the model's ability to accurately perceive visual information and perform logical reasoning by explicitly correcting drift issues.
GRPO Integration: The PRISM pipeline culminates with GRPO (Generalized Reinforcement Policy Optimization) as the final reinforcement learning algorithm.

Why This Model is Different

Traditional SFT-to-RL pipelines can suffer from distributional drift, where the fine-tuned model may lose original capabilities or fail to match the supervision distribution. PRISM directly tackles this by inserting a pre-alignment phase that uses adversarial distillation to ensure better preservation of capabilities and more faithful alignment with multimodal reasoning requirements. This makes it particularly suitable for applications demanding high accuracy in both visual perception and logical inference within multimodal contexts.

Overview

Overview of Qwen3-VL-8B-Instruct-SFT-PRISM-GRPO

Key Capabilities and Innovations

Why This Model is Different

Full Model Card (README)