MeXtract-3B: Specialized Metadata Extraction

MeXtract-3B, developed by IVUL at KAUST, is a 3.1 billion parameter model fine-tuned from Qwen2.5 3B Instruct. Its core purpose is to efficiently extract structured metadata from scientific papers using a schema-based definition for attributes. This model is built upon a synthetically generated dataset, enabling robust performance in its specialized domain.

Key Capabilities

Schema-based Extraction: Defines metadata attributes with types, min/max lengths, and options for precise control.
Light-weight Architecture: A 3.1B parameter model, offering efficiency for deployment.
High Accuracy: Achieves an average score of 73.23 on the MOLE+ benchmark, significantly outperforming base models like Qwen2.5 3B Instruct (57.16) and other 3B-4B alternatives.

Good for

Automated Metadata Retrieval: Ideal for extracting specific information (e.g., author names, affiliations, keywords) from large corpora of scientific documents.
Structured Data Generation: Useful for converting unstructured text from papers into structured, queryable data formats.
Research and Academic Applications: Enhances tools for literature review, citation management, and knowledge graph construction.

Note: MeXtract-3B is optimized for metadata extraction and may not perform well on general NLP tasks.

Overview

MeXtract-3B: Specialized Metadata Extraction

Key Capabilities

Good for

Full Model Card (README)