TigerLLM-9B-it: A Leading Bangla LLM
TigerLLM-9B-it is a 9 billion parameter instruction-tuned model, part of the TigerLLM family developed by Nishat Raihan and Marcos Zampieri at George Mason University. It addresses the significant linguistic disparity in LLM development by focusing on Bangla, the 5th most spoken language globally. This model is built upon the Gemma-2 architecture and has been accepted in ACL Main 2025.
Key Capabilities & Differentiators
- Superior Bangla Performance: Outperforms all open-source alternatives and even larger proprietary models like GPT-3.5 on standard Bangla benchmarks (MMLU-bn, PangBench-bn, BanglaQuaD, mHumanEval-bn, BEnQA, BanglaRQA).
- High-Quality Training Data: Utilizes the proprietary Bangla-TextBook corpus (10M tokens from 163 educational textbooks) for continual pretraining and the Bangla-Instruct dataset (100K native instruction-response pairs generated via self-instruct and advanced teacher models) for fine-tuning.
- Reproducible Methodology: Developed with a focus on reproducibility, addressing a common limitation in previous Bangla LLM initiatives.
- Ethical Data Curation: Emphasizes ethical practices by using open-source educational materials and rigorous filtering for cultural sensitivity and bias avoidance.
Ideal Use Cases
- Bangla Language Applications: Excellent for any application requiring advanced natural language understanding and generation in Bangla.
- Educational Tools: Particularly strong for tasks related to academic content, given its training on the Bangla-TextBook corpus.
- Research and Development: Serves as a new baseline for future research in Bangla language modeling due to its strong performance and reproducible methodology.