NECOUDBFM/Jellyfish-7B

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kLicense:cc-by-nc-4.0Architecture:Transformer0.0K Open Weights Cold

NECOUDBFM/Jellyfish-7B is a 7 billion parameter large language model developed by Haochen Zhang, Yuyang Dong, Chuan Xiao, and Masafumi Oyamada, fine-tuned from Mistral-7B-Instruct-v0.2. This model specializes in data preprocessing tasks such as error detection, data imputation, schema matching, and entity matching. It demonstrates a 56.36% winning rate against GPT-3.5-turbo (evaluated by GPT-4) and shows strong performance across various seen and unseen data preprocessing benchmarks.

Loading preview...

Jellyfish-7B: Specialized Data Preprocessing LLM

Jellyfish-7B is a 7 billion parameter large language model developed by Haochen Zhang, Yuyang Dong, Chuan Xiao, and Masafumi Oyamada, funded by NEC Corporation and Osaka University. It is fine-tuned from the mistralai/Mistral-7B-Instruct-v0.2 base model using a subset of the Jellyfish-Instruct dataset.

Key Capabilities & Performance

This model is specifically designed and optimized for various data preprocessing tasks, demonstrating competitive performance against larger models like GPT-3.5 and GPT-4 in its specialized domain. Key capabilities include:

  • Error Detection: Identifying errors in record rows or specific attribute values.
  • Data Imputation: Inferring missing attribute values based on available record information.
  • Schema Matching: Determining semantic equivalence between attributes from different tables.
  • Entity Matching: Identifying whether two records represent the same entity.
  • Column Type Annotation: Annotating column types.
  • Attribute Value Extraction: Extracting specific attribute values.

Jellyfish-7B achieves a 56.36% winning rate against GPT-3.5-turbo (evaluated by GPT-4) and shows strong benchmark results. For instance, in Entity Matching, it achieves 100% on Beer and Fodors-Zagats datasets, and 98.65% on DBLP-ACM. Its larger counterpart, Jellyfish-13B, often surpasses GPT-3.5 and sometimes GPT-4 on specific data preprocessing tasks, as detailed in the Jellyfish paper.

Training & Usage

The model was fine-tuned using LoRA, targeting q_proj, k_proj, v_proj, and o_proj modules. It is licensed under a Non-Commercial Creative Commons license (CC BY-NC-4.0). For accelerated inference, the developers strongly recommend using vLLM.