godx7/PixelEyes-8B
PixelEyes-8B is an 8 billion parameter multimodal large language model (MLLM) developed by Dengxian Gong et al. that specializes in active visual search. It decouples perception and reasoning by delegating fine-grained localization to a specialized perception tool, enabling efficient and accurate multi-turn visual reasoning. This model is designed for tasks requiring pinpoint visual evidence seeking within images, enhancing MLLMs' ability to precisely locate visual information.
Loading preview...
PixelEyes-8B: Decoupling Perception and Reasoning
PixelEyes-8B is an 8 billion parameter multimodal large language model (MLLM) developed by Dengxian Gong and their collaborators. This model introduces a novel approach to visual reasoning by decoupling perception and reasoning, specifically designed for pinpoint visual evidence seeking.
Key Capabilities
- Enhanced Active Visual Search: PixelEyes improves how MLLMs perform active visual searches.
- Fine-Grained Localization: It delegates the task of precise visual localization to a specialized perception tool.
- Efficient Multi-Turn Visual Reasoning: By separating perception from reasoning, the model achieves more efficient and accurate reasoning over multiple turns of interaction.
What Makes It Different
Unlike traditional MLLMs that might integrate perception and reasoning more tightly, PixelEyes's architecture allows for a dedicated component to handle the visual localization, freeing the language model to focus on higher-level reasoning. This specialization is particularly beneficial for applications requiring precise identification and referencing of visual elements within an image. The model's development is detailed in the paper "PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking" and is accompanied by the Pinpoint-Bench dataset for evaluation.