hengranZhang/BOOM_4B_eng_data_v1
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 5, 2026Architecture:Transformer Cold

The BOOM_4B_eng_data_v1 model by hengranZhang is a 4 billion parameter text embedding model designed for robust generalization and efficient incremental learning. It utilizes a Bagging-based Robust Model Merging (BOOM) strategy, training multiple embedding models on sampled subsets and merging them to improve out-of-domain performance and reduce retraining costs. This model is specifically trained on English text data, excelling in various NLP tasks including retrieval, reranking, classification, clustering, and semantic text similarity.

Loading preview...

BOOM_4B_eng_data_v1: Robust Text Embeddings via Bagging-Based Model Merging

This model, developed by hengranZhang, is a 4 billion parameter text embedding model built upon the novel Bagging-based rObust mOdel Merging (BOOM) strategy. BOOM addresses limitations of traditional multi-task training, such as suboptimal out-of-domain (OOD) generalization and expensive full retraining for incremental learning.

Key Capabilities & Innovations

  • Robust Generalization: BOOM trains multiple embedding models on sampled data subsets and merges them into a single, more robust model, consistently improving both in-domain and OOD performance.
  • Efficient Incremental Learning: It supports efficient updates by training lightweight models on new data and merging them into the existing model, significantly reducing training costs compared to full retraining.
  • Multi-SLERP Merging: The model was merged using the Multi-SLERP method, combining several individual models (Data_mixing_sampled80_full, Data_mixingsampled100_full, etc.) with specific weights.

Training Data & Focus

This specific version, BOOM_4B_eng_data_v1, is trained exclusively on a diverse English text dataset of approximately 2 million entries. This dataset covers a wide range of NLP tasks, including:

  • Retrieval: ELI5, HotpotQA, MSMARCO, NQ, SQuAD, TriviaQA, FiQA.
  • Reranking: StackOverFlowDupQuestions.
  • Classification: AmazonReviews, Banking77, Emotion, MTOPIntent, IMDB, ToxicConversations, TweetSentimentExtraction, AmazonCounterfactual.
  • Clustering: Arxiv/Biorxiv/Medrxiv/Reddit/StackExchangeClustering, TwentyNewsgroups.
  • Semantic Text Similarity (STS): STS12, STS22, STSB.