inclusionAI/ZwZ-8B

VISIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Feb 12, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

inclusionAI/ZwZ-8B is an 8 billion parameter fine-grained multimodal perception model built upon Qwen3-VL-8B. It is trained using Region-to-Image Distillation (R2I) and reinforcement learning to achieve superior fine-grained visual understanding in a single forward pass, eliminating inference-time zooming or tool calling. This model excels on fine-grained perception benchmarks among open-source models of comparable size and demonstrates strong out-of-distribution generalization on visual reasoning, GUI agent, and AIGC detection tasks.

Loading preview...

Overview

ZwZ-8B is an 8 billion parameter multimodal perception model developed by inclusionAI, based on the Qwen3-VL-8B architecture. Its core innovation lies in its training methodology, which utilizes Region-to-Image Distillation (R2I) combined with reinforcement learning. This approach enables the model to perform fine-grained visual understanding in a single forward pass, bypassing the need for traditional inference-time zooming or external tool calls, which often introduce latency.

Key Capabilities

  • Single-Pass Efficiency: Achieves detailed visual perception without the overhead of repeated tool calls or visual re-encoding during inference.
  • Superior Fine-Grained Accuracy: Demonstrates state-of-the-art performance on various perception benchmarks when compared to other open-source models of similar scale.
  • Broad Generalization: Shows strong out-of-distribution generalization across diverse tasks, including visual reasoning, GUI agent interactions, and AIGC (AI-generated content) detection.

How It Works

ZwZ-8B transforms the concept of "zooming" from an inference-time operation into a training-time primitive. This involves:

  1. Zooming into micro-cropped regions and leveraging powerful teacher models (like Qwen3-VL-235B, GLM-4.5V) to generate high-quality VQA (Visual Question Answering) data.
  2. Distilling this region-grounded supervision back to the full image, incorporating explicit bounding-box overlays.
  3. Reinforcing this learning through RL training to enable single-glance fine-grained perception.

Good For

  • Applications requiring efficient and accurate fine-grained visual analysis.
  • Tasks demanding real-time multimodal perception where latency is a concern.
  • Use cases involving visual reasoning, GUI automation, or detecting AI-generated content.