NECOUDBFM/Jellyfish-8B
Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 23, 2024License:cc-by-nc-4.0Architecture:Transformer0.0K Open Weights Warm

Jellyfish-8B is an 8 billion parameter large language model developed by Haochen Zhang, Yuyang Dong, Chuan Xiao, and Masafumi Oyamada, funded by NEC Corporation and Osaka University. Fine-tuned from Meta-Llama-3-8B-Instruct, this model specializes in data preprocessing tasks such as error detection, data imputation, schema matching, and entity matching. It demonstrates strong performance across various data cleaning benchmarks, often outperforming or competing with larger models like GPT-3.5 and GPT-4 on specific tasks.

Loading preview...

Overview

NECOUDBFM/Jellyfish-8B is an 8 billion parameter large language model, fine-tuned from Meta-Llama-3-8B-Instruct by Haochen Zhang, Yuyang Dong, Chuan Xiao, and Masafumi Oyamada. Developed with funding from NEC Corporation and Osaka University, this model is specifically designed for data preprocessing tasks. It is part of a family of Jellyfish models, with other sizes including Jellyfish-7B and Jellyfish-13B.

Key Capabilities

  • Error Detection: Identifies errors in specific attribute values within records, including spelling errors, inconsistencies, or illogical values.
  • Data Imputation: Infers missing attribute values based on available information within a record.
  • Schema Matching: Determines semantic equivalence between two attributes (columns) for table merging.
  • Entity Matching: Compares two records to determine if they represent the same entity.
  • Column Type Annotation & Attribute Value Extraction: Also performs well on these unseen tasks, as detailed in the benchmarks.

Performance Highlights

The Jellyfish-8B model shows competitive performance against models like GPT-3.5 and GPT-4 on various data preprocessing benchmarks. For instance, on Entity Matching tasks like Amazon-Google, Jellyfish-8B achieves 81.42% F1, and on Beer, it reaches 100% F1. While its performance varies across tasks, it often provides strong results, particularly in seen data imputation and entity matching scenarios. The model was trained using LoRA, targeting q_proj, k_proj, v_proj, and o_proj modules for efficient training.

When to Use

Jellyfish-8B is ideal for applications requiring automated data cleaning and preparation, especially for structured and semi-structured data. Its specialized fine-tuning makes it a strong candidate for tasks like ensuring data quality, integrating datasets, and preparing data for further analysis or machine learning. Users are recommended to run Jellyfish with vLLM for accelerated inference.