PeiyangLiu/CoE-Wiki-CoE-8B
PeiyangLiu/CoE-Wiki-CoE-8B is an 8 billion parameter vision-language checkpoint developed by PeiyangLiu, fine-tuned for Chain-of-Evidence (CoE) question answering on the Wiki-CoE dataset. This model excels at processing natural language questions and candidate screenshot images to produce structured answers with an evidence chain. It is specifically designed for research in multimodal QA, visual evidence selection, and evidence-grounded reasoning over document-like screenshots, supporting a context length of 32768 tokens.
Loading preview...
CoE-Wiki-CoE-8B: Vision-Language Model for Chain-of-Evidence QA
CoE-Wiki-CoE-8B is an 8 billion parameter vision-language model developed by PeiyangLiu, specifically fine-tuned for Chain-of-Evidence (CoE) question answering. This model's primary function is to process a natural-language question alongside candidate screenshot images and generate a structured answer that includes an explicit evidence chain.
Key Capabilities
- Multimodal Question Answering: Integrates natural language questions with visual information from screenshots.
- Evidence Selection: Identifies and localizes supporting evidence within candidate screenshots.
- Structured Output: Produces a JSON-style response containing the
evidence_chain(selected screenshots and localized evidence) and theanswer. - Research Focus: Intended for research in multimodal QA, visual evidence selection, and evidence-grounded reasoning over document-like documents.
Training and Usage
The model was fine-tuned on the Wiki-CoE dataset. Developers can utilize the transformers library for inference, loading the model and processor with AutoProcessor and AutoModelForImageTextToText. For reproducible results, it's recommended to use the same image preprocessing and prompt format as detailed in the CoE repository.