Overview
TigerLLM-1B-it: A Specialized Bangla Language Model
TigerLLM-1B-it is a 1 billion parameter instruction-tuned model from the TigerLLM family, developed by Nishat Raihan and Marcos Zampieri at George Mason University. This model is specifically engineered to address the significant gap in high-quality, reproducible Large Language Models for Bangla, the world's 5th most spoken language. It is built upon a LLaMA-3.2 base and has been continually pre-trained and fine-tuned using unique, high-quality Bangla datasets.
Key Capabilities & Differentiators
- Bangla-Centric Design: Developed from the ground up for the Bangla language, utilizing a 10M-token "Bangla-TextBook" corpus derived from authentic educational materials (Grades 6-12).
- High-Quality Instruction Following: Fine-tuned on "Bangla-Instruct," a 100K instruction-response dataset generated via a self-instruct framework with GPT-4 and Claude-3.5-Sonnet, ensuring native linguistic quality and cultural sensitivity.
- Superior Performance: Benchmarked against multiple Bangla-specific evaluations (MMLU-bn, PangBench-bn, BanglaQuaD, mHumanEval-bn, BEnQA, BanglaRQA), TigerLLM-1B-it demonstrates performance that surpasses existing open-source Bangla LLMs and even larger proprietary models like GPT-3.5 in several metrics.
- Reproducibility: Emphasizes transparent methodology and dataset creation to ensure reproducibility, a common limitation in previous Bangla LLM initiatives.
Good For
- Bangla NLP Applications: Ideal for developers and researchers building applications requiring robust understanding and generation in Bangla.
- Educational Tools: Particularly well-suited for tasks related to educational content, given its training on the Bangla-TextBook corpus.
- Research in Low-Resource Languages: Serves as a strong baseline for further research and development in Bangla and other low-resource language LLMs.