TigerCoder-1B: Bangla Code Generation LLM
TigerCoder-1B, developed by Nishat Raihan, Antonios Anastasopoulos, and Marcos Zampieri from George Mason University, is a 1 billion parameter instruction-tuned causal language model. It is part of the first dedicated family of Code LLMs for Bangla, a language severely underrepresented in code generation despite its large native speaker base. This model addresses the performance drop observed in general LLMs when processing Bangla coding prompts.
Key Capabilities & Features
- Bangla Code Generation: Specialized for generating code from Bangla instructions across multiple programming languages.
- Superior Performance: Despite its compact 1B parameter size, TigerCoder-1B surpasses models up to 27x larger (including Gemma-3 27B) by 4-8 percentage points on Bangla code generation benchmarks.
- Multilingual Code Support: Achieves strong Pass@1 scores on Bangla prompts for Python (0.69), C++ (0.64), Java (0.58), and JavaScript (0.53) on mHumanEval.
- Dedicated Training Data: Fine-tuned on 300K Bangla instruction-code pairs from the custom-created Bangla-Code-Instruct dataset, comprising Self-Instruct, Synthetic, and Translated+Filtered subsets.
- MBPP-Bangla Benchmark: Introduced a 974-problem benchmark with expert-validated Bangla programming tasks across 5 languages.
Why TigerCoder-1B is Different
TigerCoder-1B demonstrates that high-quality, targeted, domain-specific data can outweigh model scale for low-resource code generation. It highlights that direct machine translation of coding prompts from Bangla to English does not improve performance due to mistranslation of code-specific keywords. This model is optimized primarily for Bangla code generation tasks, and its performance on general NLU or non-code tasks may not match general-purpose models.