songjhPKU/RxnCaption-VL
RxnCaption-VL by songjhPKU is a 7 billion parameter vision-language model fine-tuned from Qwen2.5-VL-7B-Instruct, specifically designed for chemical reaction diagram parsing. It processes images annotated with bounding-box indices (BIVP) to output structured JSON descriptions of chemical reactions. This model excels at extracting reactants, conditions, and products from complex chemical diagrams, making it highly specialized for chemistry-related visual data interpretation.
Loading preview...
RxnCaption-VL: Chemical Reaction Diagram Parsing
RxnCaption-VL, developed by songjhPKU, is a specialized 7 billion parameter vision-language model built upon Qwen2.5-VL-7B-Instruct. Its core function is to parse chemical reaction diagrams, transforming visual information into structured JSON outputs.
Key Capabilities
- Visual Prompt Guided Captioning: The model takes images annotated with Bounding-box Index Visual Prompts (BIVP), where bounding boxes and numeric labels highlight structures and text within the diagram.
- Structured Output: It generates a JSON list for each reaction, detailing 'reactants', 'conditions', and 'products', with each element referencing either a structure index or extracted text.
- Chemistry Expertise: Fine-tuned on the U-RxnDiagram-15k dataset (approximately 59,000 augmented samples), it demonstrates proficiency in interpreting complex chemical schematics.
- Performance: Achieves a Hard F1 score of 75.5 and Soft F1 of 88.2 on the RxnScribe-test benchmark, and 55.5 (Hard F1) / 67.6 (Soft F1) on the U-RxnDiagram-15k-test.
Use Cases
RxnCaption-VL is ideal for automating the extraction of chemical reaction information from diagrams, supporting applications in chemical research, patent analysis, and digital chemistry databases. It provides a programmatic way to convert visual chemical data into machine-readable formats, streamlining data processing in chemistry-focused fields.