Name: hengranZhang/BOOM_4B_eng_data_v1 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: hengranZhang

BOOM_4B_eng_data_v1: Robust Text Embeddings via Bagging-Based Model Merging

This model, developed by hengranZhang, is a 4 billion parameter text embedding model built upon the novel Bagging-based rObust mOdel Merging (BOOM) strategy. BOOM addresses limitations of traditional multi-task training, such as suboptimal out-of-domain (OOD) generalization and expensive full retraining for incremental learning.

Key Capabilities & Innovations

Robust Generalization: BOOM trains multiple embedding models on sampled data subsets and merges them into a single, more robust model, consistently improving both in-domain and OOD performance.
Efficient Incremental Learning: It supports efficient updates by training lightweight models on new data and merging them into the existing model, significantly reducing training costs compared to full retraining.
Multi-SLERP Merging: The model was merged using the Multi-SLERP method, combining several individual models (Data_mixing_sampled80_full, Data_mixingsampled100_full, etc.) with specific weights.

Training Data & Focus

This specific version, BOOM_4B_eng_data_v1, is trained exclusively on a diverse English text dataset of approximately 2 million entries. This dataset covers a wide range of NLP tasks, including:

Retrieval: ELI5, HotpotQA, MSMARCO, NQ, SQuAD, TriviaQA, FiQA.
Reranking: StackOverFlowDupQuestions.
Classification: AmazonReviews, Banking77, Emotion, MTOPIntent, IMDB, ToxicConversations, TweetSentimentExtraction, AmazonCounterfactual.
Clustering: Arxiv/Biorxiv/Medrxiv/Reddit/StackExchangeClustering, TwentyNewsgroups.
Semantic Text Similarity (STS): STS12, STS22, STSB.

Overview

BOOM_4B_eng_data_v1: Robust Text Embeddings via Bagging-Based Model Merging

Key Capabilities & Innovations

Training Data & Focus

Full Model Card (README)