Name: UCSC-VLAA/VLM-CapCurriculum-Qwen3-VL-8B-Staged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: UCSC-VLAA

VLM-CapCurriculum-Qwen3-VL-8B-Staged: Enhanced Visual Reasoning

This model is an 8 billion parameter vision-language model (VLM) developed by UCSC-VLAA, post-trained from the Qwen3-VL-8B-Instruct backbone. Its core innovation lies in a staged, capability-dimension curriculum that sequentially trains visual perception, textual reasoning, and visual reasoning. This approach, detailed in the ICML 2026 paper "From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models," addresses the bottleneck of visual perception in VLMs.

Key Capabilities & Performance

Improved Visual Reasoning: Achieves higher scores on visual math benchmarks (e.g., MathVista, MathVision) and perception benchmarks (e.g., A-OKVQA, MMStar) compared to the base Qwen3-VL-8B and merged training methods.
Efficient Reasoning: Produces 20.8% shorter reasoning traces, indicating better perceptual understanding that reduces the need for extensive textual reasoning.
Staged Training: Undergoes three distinct RLVR stages: visual perception, textual reasoning, and visual reasoning, using specialized datasets for each stage.

Intended Use & Limitations

Research Focus: Primarily designed for research in vision-language reasoning, post-training methodologies, and curriculum learning.
Inherited Profile: Inherits the safety and bias profile of the underlying Qwen3-VL-8B-Instruct backbone, without additional alignment fine-tuning.
Context & Prompt: Trained with a 2048-token maximum prompt length and a fixed group size of 5. Behavior with much longer contexts or significantly different prompt formats is not characterized.

Inference Format

The model expects a unified system prompt requiring an internal monologue enclosed in <think> </think> tags, followed by the final answer in \boxed{}.

Overview

VLM-CapCurriculum-Qwen3-VL-8B-Staged: Enhanced Visual Reasoning

Key Capabilities & Performance

Intended Use & Limitations

Inference Format

Full Model Card (README)