prism-vlm/Qwen3-VL-8B-Instruct-SFT-PRISM-GRPO

VISIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Apr 26, 2026Architecture:Transformer Cold

Qwen3-VL-8B-Instruct-SFT-PRISM-GRPO is an 8 billion parameter large multimodal model (LMM) developed by Sudong Wang et al. It is based on Qwen3-VL-8B-Instruct and further optimized using the PRISM three-stage post-training pipeline, which includes a pre-alignment stage via black-box adversarial on-policy distillation. This pipeline provides corrective signals for visual grounding and reasoning consistency, addressing distributional drift issues common in SFT-to-RL pipelines. The model is designed for enhanced multimodal reasoning capabilities, particularly in scenarios where visual perception and logical reasoning errors need explicit correction.

Loading preview...

Overview of Qwen3-VL-8B-Instruct-SFT-PRISM-GRPO

This model, Qwen3-VL-8B-Instruct-SFT-PRISM-GRPO, is an 8 billion parameter large multimodal model (LMM) built upon the Qwen3-VL-8B-Instruct architecture. It distinguishes itself through the application of PRISM, a novel three-stage post-training pipeline designed to mitigate distributional drift often encountered in standard Supervised Fine-Tuning (SFT) followed by Reinforcement Learning with Verifiable Rewards (RLVR) approaches.

Key Capabilities and Innovations

  • PRISM Pipeline: Introduces an explicit pre-alignment stage between SFT and RLVR. This stage utilizes black-box adversarial on-policy distillation with a Mixture-of-Experts discriminator.
  • Corrective Signals: Provides separate corrective signals specifically for visual grounding and reasoning consistency, addressing distinct error types in multimodal reasoning.
  • Enhanced Multimodal Reasoning: Aims to improve the model's ability to accurately perceive visual information and perform logical reasoning by explicitly correcting drift issues.
  • GRPO Integration: The PRISM pipeline culminates with GRPO (Generalized Reinforcement Policy Optimization) as the final reinforcement learning algorithm.

Why This Model is Different

Traditional SFT-to-RL pipelines can suffer from distributional drift, where the fine-tuned model may lose original capabilities or fail to match the supervision distribution. PRISM directly tackles this by inserting a pre-alignment phase that uses adversarial distillation to ensure better preservation of capabilities and more faithful alignment with multimodal reasoning requirements. This makes it particularly suitable for applications demanding high accuracy in both visual perception and logical inference within multimodal contexts.