inclusionAI/ZwZ-8B
inclusionAI/ZwZ-8B is an 8 billion parameter fine-grained multimodal perception model built upon Qwen3-VL-8B. It is trained using Region-to-Image Distillation (R2I) and reinforcement learning to achieve superior fine-grained visual understanding in a single forward pass, eliminating inference-time zooming or tool calling. This model excels on fine-grained perception benchmarks among open-source models of comparable size and demonstrates strong out-of-distribution generalization on visual reasoning, GUI agent, and AIGC detection tasks.
Loading preview...
Overview
ZwZ-8B is an 8 billion parameter multimodal perception model developed by inclusionAI, based on the Qwen3-VL-8B architecture. Its core innovation lies in its training methodology, which utilizes Region-to-Image Distillation (R2I) combined with reinforcement learning. This approach enables the model to perform fine-grained visual understanding in a single forward pass, bypassing the need for traditional inference-time zooming or external tool calls, which often introduce latency.
Key Capabilities
- Single-Pass Efficiency: Achieves detailed visual perception without the overhead of repeated tool calls or visual re-encoding during inference.
- Superior Fine-Grained Accuracy: Demonstrates state-of-the-art performance on various perception benchmarks when compared to other open-source models of similar scale.
- Broad Generalization: Shows strong out-of-distribution generalization across diverse tasks, including visual reasoning, GUI agent interactions, and AIGC (AI-generated content) detection.
How It Works
ZwZ-8B transforms the concept of "zooming" from an inference-time operation into a training-time primitive. This involves:
- Zooming into micro-cropped regions and leveraging powerful teacher models (like Qwen3-VL-235B, GLM-4.5V) to generate high-quality VQA (Visual Question Answering) data.
- Distilling this region-grounded supervision back to the full image, incorporating explicit bounding-box overlays.
- Reinforcing this learning through RL training to enable single-glance fine-grained perception.
Good For
- Applications requiring efficient and accurate fine-grained visual analysis.
- Tasks demanding real-time multimodal perception where latency is a concern.
- Use cases involving visual reasoning, GUI automation, or detecting AI-generated content.