PeiyangLiu/CoE-SlideVQA-8B

VISIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:May 5, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

PeiyangLiu/CoE-SlideVQA-8B is an 8 billion parameter vision-language model fine-tuned for Chain-of-Evidence question answering over presentation slide screenshots. Developed by PeiyangLiu, this model is designed to answer natural-language questions by identifying and localizing visual evidence within provided slide images. It specializes in grounded multimodal reasoning and evidence selection for slide-based visual QA, outputting both the selected evidence chain and the final answer.

Loading preview...

CoE-SlideVQA-8B: Vision-Language Model for Slide-Based QA

CoE-SlideVQA-8B is an 8 billion parameter vision-language checkpoint developed by PeiyangLiu, specifically fine-tuned for Chain-of-Evidence (CoE) question answering over presentation slide screenshots. This model excels at understanding natural-language questions and extracting relevant visual evidence from provided slide images to formulate an answer.

Key Capabilities

  • Visual Question Answering (VQA) on Slides: Answers questions by analyzing content within presentation slides.
  • Evidence Selection: Identifies and localizes specific visual evidence within slide screenshots that supports the generated answer.
  • Grounded Multimodal Reasoning: Connects textual questions with visual information to provide contextually relevant responses.
  • Structured Output: Produces a JSON-style response including an evidence_chain (selected supporting slides and localized evidence) and the answer.

Good For

  • Research and Prototyping: Ideal for exploring slide-based visual QA, evidence selection, and multimodal reasoning tasks.
  • Analyzing Presentations: Can be used to extract information or answer specific queries directly from presentation visuals.
  • Developing Intelligent Assistants: Forms a core component for systems requiring visual understanding of presentation content.

For detailed prompt formatting and evaluation, refer to the project code. The model was trained using the Wiki-CoE dataset.