PeiyangLiu/CoE-SlideVQA-8B
PeiyangLiu/CoE-SlideVQA-8B is an 8 billion parameter vision-language model fine-tuned for Chain-of-Evidence question answering over presentation slide screenshots. Developed by PeiyangLiu, this model is designed to answer natural-language questions by identifying and localizing visual evidence within provided slide images. It specializes in grounded multimodal reasoning and evidence selection for slide-based visual QA, outputting both the selected evidence chain and the final answer.
Loading preview...
CoE-SlideVQA-8B: Vision-Language Model for Slide-Based QA
CoE-SlideVQA-8B is an 8 billion parameter vision-language checkpoint developed by PeiyangLiu, specifically fine-tuned for Chain-of-Evidence (CoE) question answering over presentation slide screenshots. This model excels at understanding natural-language questions and extracting relevant visual evidence from provided slide images to formulate an answer.
Key Capabilities
- Visual Question Answering (VQA) on Slides: Answers questions by analyzing content within presentation slides.
- Evidence Selection: Identifies and localizes specific visual evidence within slide screenshots that supports the generated answer.
- Grounded Multimodal Reasoning: Connects textual questions with visual information to provide contextually relevant responses.
- Structured Output: Produces a JSON-style response including an
evidence_chain(selected supporting slides and localized evidence) and theanswer.
Good For
- Research and Prototyping: Ideal for exploring slide-based visual QA, evidence selection, and multimodal reasoning tasks.
- Analyzing Presentations: Can be used to extract information or answer specific queries directly from presentation visuals.
- Developing Intelligent Assistants: Forms a core component for systems requiring visual understanding of presentation content.
For detailed prompt formatting and evaluation, refer to the project code. The model was trained using the Wiki-CoE dataset.