MeXtract-3B: Specialized Metadata Extraction
MeXtract-3B, developed by IVUL at KAUST, is a 3.1 billion parameter model fine-tuned from Qwen2.5 3B Instruct. Its core purpose is to efficiently extract structured metadata from scientific papers using a schema-based definition for attributes. This model is built upon a synthetically generated dataset, enabling robust performance in its specialized domain.
Key Capabilities
- Schema-based Extraction: Defines metadata attributes with types, min/max lengths, and options for precise control.
- Light-weight Architecture: A 3.1B parameter model, offering efficiency for deployment.
- High Accuracy: Achieves an average score of 73.23 on the MOLE+ benchmark, significantly outperforming base models like Qwen2.5 3B Instruct (57.16) and other 3B-4B alternatives.
Good for
- Automated Metadata Retrieval: Ideal for extracting specific information (e.g., author names, affiliations, keywords) from large corpora of scientific documents.
- Structured Data Generation: Useful for converting unstructured text from papers into structured, queryable data formats.
- Research and Academic Applications: Enhances tools for literature review, citation management, and knowledge graph construction.
Note: MeXtract-3B is optimized for metadata extraction and may not perform well on general NLP tasks.