PixelEyes-8B: Decoupling Perception and Reasoning

PixelEyes-8B is an 8 billion parameter multimodal large language model (MLLM) developed by Dengxian Gong and their collaborators. This model introduces a novel approach to visual reasoning by decoupling perception and reasoning, specifically designed for pinpoint visual evidence seeking.

Key Capabilities

Enhanced Active Visual Search: PixelEyes improves how MLLMs perform active visual searches.
Fine-Grained Localization: It delegates the task of precise visual localization to a specialized perception tool.
Efficient Multi-Turn Visual Reasoning: By separating perception from reasoning, the model achieves more efficient and accurate reasoning over multiple turns of interaction.

What Makes It Different

Unlike traditional MLLMs that might integrate perception and reasoning more tightly, PixelEyes's architecture allows for a dedicated component to handle the visual localization, freeing the language model to focus on higher-level reasoning. This specialization is particularly beneficial for applications requiring precise identification and referencing of visual elements within an image. The model's development is detailed in the paper "PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking" and is accompanied by the Pinpoint-Bench dataset for evaluation.

Overview

PixelEyes-8B: Decoupling Perception and Reasoning

Key Capabilities

What Makes It Different

Full Model Card (README)