Jellyfish-13B: Specialized for Data Preprocessing

Jellyfish-13B is a 13 billion parameter large language model developed by Haochen Zhang, Yuyang Dong, Chuan Xiao, and Masafumi Oyamada, fine-tuned from the Open-Orca/OpenOrca-Platypus2-13B model. Its core specialization lies in data preprocessing tasks, offering competitive performance against state-of-the-art algorithms and larger LLMs like GPT-3.5 and GPT-4.

Key Capabilities

Data Preprocessing: Excels in error detection, data imputation, schema matching, and entity matching.
Cost-Effective: As a 13B model, it enables local execution without compromising data security.
Dual Versions: Available in two distinct versions:
- Jellyfish-13B (main branch): Designed for precise, straightforward answers, ideal for integration into data management systems where responses can be easily transformed into code.
- Jellyfish-13B-Interpreter (alternative branch): Fine-tuned with reasoning and sequential thought processes, distilling knowledge from GPT-4, making it more user-oriented with in-depth data insights.
Strong NLP Performance: Maintains robust performance in general NLP tasks, as evidenced by benchmark comparisons.

Performance Highlights

Jellyfish-13B demonstrates strong performance across various data preprocessing tasks, often rivaling or surpassing larger models and specialized algorithms. For instance, it achieved 99.33% on Error Detection (Adult dataset) and 100% on Data Imputation (Buy dataset). On average, it achieved 86.02% across a suite of seen data preprocessing tasks, outperforming GPT-4's 84.17%.

Good for

Data Management Systems: Jellyfish-13B's precise responses are well-suited for automated data cleaning and preparation pipelines.
Data Analysts & Scientists: Jellyfish-13B-Interpreter provides detailed insights for users without advanced coding skills.
Local Deployment: Its 13B size allows for efficient local execution, addressing data security and cost concerns.

For more details, refer to the Jellyfish paper.

Overview

Jellyfish-13B: Specialized for Data Preprocessing

Key Capabilities

Performance Highlights

Good for

Full Model Card (README)