UCSC-VLAA/VLM-CapCurriculum-Qwen3-VL-8B-Staged
UCSC-VLAA/VLM-CapCurriculum-Qwen3-VL-8B-Staged is an 8 billion parameter vision-language model post-trained from Qwen/Qwen3-VL-8B-Instruct. It utilizes a novel staged, capability-dimension curriculum to decouple perception and reasoning, enhancing visual reasoning performance. This model achieves improved visual math and perception scores, producing shorter reasoning traces compared to merged training baselines. It is primarily intended for research in vision-language reasoning and post-training methodologies.
Loading preview...
VLM-CapCurriculum-Qwen3-VL-8B-Staged: Enhanced Visual Reasoning
This model is an 8 billion parameter vision-language model (VLM) developed by UCSC-VLAA, post-trained from the Qwen3-VL-8B-Instruct backbone. Its core innovation lies in a staged, capability-dimension curriculum that sequentially trains visual perception, textual reasoning, and visual reasoning. This approach, detailed in the ICML 2026 paper "From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models," addresses the bottleneck of visual perception in VLMs.
Key Capabilities & Performance
- Improved Visual Reasoning: Achieves higher scores on visual math benchmarks (e.g., MathVista, MathVision) and perception benchmarks (e.g., A-OKVQA, MMStar) compared to the base Qwen3-VL-8B and merged training methods.
- Efficient Reasoning: Produces 20.8% shorter reasoning traces, indicating better perceptual understanding that reduces the need for extensive textual reasoning.
- Staged Training: Undergoes three distinct RLVR stages: visual perception, textual reasoning, and visual reasoning, using specialized datasets for each stage.
Intended Use & Limitations
- Research Focus: Primarily designed for research in vision-language reasoning, post-training methodologies, and curriculum learning.
- Inherited Profile: Inherits the safety and bias profile of the underlying Qwen3-VL-8B-Instruct backbone, without additional alignment fine-tuning.
- Context & Prompt: Trained with a 2048-token maximum prompt length and a fixed group size of 5. Behavior with much longer contexts or significantly different prompt formats is not characterized.
Inference Format
- The model expects a unified system prompt requiring an internal monologue enclosed in
<think> </think>tags, followed by the final answer in\boxed{}.