Jellyfish-7B: Specialized Data Preprocessing LLM
Jellyfish-7B is a 7 billion parameter large language model developed by Haochen Zhang, Yuyang Dong, Chuan Xiao, and Masafumi Oyamada, funded by NEC Corporation and Osaka University. It is fine-tuned from the mistralai/Mistral-7B-Instruct-v0.2 base model using a subset of the Jellyfish-Instruct dataset.
Key Capabilities & Performance
This model is specifically designed and optimized for various data preprocessing tasks, demonstrating competitive performance against larger models like GPT-3.5 and GPT-4 in its specialized domain. Key capabilities include:
- Error Detection: Identifying errors in record rows or specific attribute values.
- Data Imputation: Inferring missing attribute values based on available record information.
- Schema Matching: Determining semantic equivalence between attributes from different tables.
- Entity Matching: Identifying whether two records represent the same entity.
- Column Type Annotation: Annotating column types.
- Attribute Value Extraction: Extracting specific attribute values.
Jellyfish-7B achieves a 56.36% winning rate against GPT-3.5-turbo (evaluated by GPT-4) and shows strong benchmark results. For instance, in Entity Matching, it achieves 100% on Beer and Fodors-Zagats datasets, and 98.65% on DBLP-ACM. Its larger counterpart, Jellyfish-13B, often surpasses GPT-3.5 and sometimes GPT-4 on specific data preprocessing tasks, as detailed in the Jellyfish paper.
Training & Usage
The model was fine-tuned using LoRA, targeting q_proj, k_proj, v_proj, and o_proj modules. It is licensed under a Non-Commercial Creative Commons license (CC BY-NC-4.0). For accelerated inference, the developers strongly recommend using vLLM.