Jellyfish-13B: Specialized for Data Preprocessing
Jellyfish-13B is a 13 billion parameter large language model developed by Haochen Zhang, Yuyang Dong, Chuan Xiao, and Masafumi Oyamada, fine-tuned from the Open-Orca/OpenOrca-Platypus2-13B model. Its core specialization lies in data preprocessing tasks, offering competitive performance against state-of-the-art algorithms and larger LLMs like GPT-3.5 and GPT-4.
Key Capabilities
- Data Preprocessing: Excels in error detection, data imputation, schema matching, and entity matching.
- Cost-Effective: As a 13B model, it enables local execution without compromising data security.
- Dual Versions: Available in two distinct versions:
- Jellyfish-13B (main branch): Designed for precise, straightforward answers, ideal for integration into data management systems where responses can be easily transformed into code.
- Jellyfish-13B-Interpreter (alternative branch): Fine-tuned with reasoning and sequential thought processes, distilling knowledge from GPT-4, making it more user-oriented with in-depth data insights.
- Strong NLP Performance: Maintains robust performance in general NLP tasks, as evidenced by benchmark comparisons.
Performance Highlights
Jellyfish-13B demonstrates strong performance across various data preprocessing tasks, often rivaling or surpassing larger models and specialized algorithms. For instance, it achieved 99.33% on Error Detection (Adult dataset) and 100% on Data Imputation (Buy dataset). On average, it achieved 86.02% across a suite of seen data preprocessing tasks, outperforming GPT-4's 84.17%.
Good for
- Data Management Systems: Jellyfish-13B's precise responses are well-suited for automated data cleaning and preparation pipelines.
- Data Analysts & Scientists: Jellyfish-13B-Interpreter provides detailed insights for users without advanced coding skills.
- Local Deployment: Its 13B size allows for efficient local execution, addressing data security and cost concerns.
For more details, refer to the Jellyfish paper.