Name: PeiyangLiu/CoE-SlideVQA-8B API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: PeiyangLiu

CoE-SlideVQA-8B: Vision-Language Model for Slide-Based QA

CoE-SlideVQA-8B is an 8 billion parameter vision-language checkpoint developed by PeiyangLiu, specifically fine-tuned for Chain-of-Evidence (CoE) question answering over presentation slide screenshots. This model excels at understanding natural-language questions and extracting relevant visual evidence from provided slide images to formulate an answer.

Key Capabilities

Visual Question Answering (VQA) on Slides: Answers questions by analyzing content within presentation slides.
Evidence Selection: Identifies and localizes specific visual evidence within slide screenshots that supports the generated answer.
Grounded Multimodal Reasoning: Connects textual questions with visual information to provide contextually relevant responses.
Structured Output: Produces a JSON-style response including an evidence_chain (selected supporting slides and localized evidence) and the answer.

Good For

Research and Prototyping: Ideal for exploring slide-based visual QA, evidence selection, and multimodal reasoning tasks.
Analyzing Presentations: Can be used to extract information or answer specific queries directly from presentation visuals.
Developing Intelligent Assistants: Forms a core component for systems requiring visual understanding of presentation content.

For detailed prompt formatting and evaluation, refer to the project code. The model was trained using the Wiki-CoE dataset.

Overview

CoE-SlideVQA-8B: Vision-Language Model for Slide-Based QA

Key Capabilities

Good For

Full Model Card (README)