ThaiLLM-8B: A Thai-Optimized Base Language Model
ThaiLLM-8B is a powerful base language model developed by ThaiLLM, built upon the Qwen3-8B-Base architecture. It underwent extensive continued pre-training on a diverse corpus of approximately 63 billion tokens, with a strong emphasis on Thai language data (over 31 billion Thai tokens from Fineweb2-TH and additional curated datasets).
Key Capabilities & Differentiators
- Enhanced Thai Language Understanding: ThaiLLM-8B significantly outperforms its Qwen3-8B-Base counterpart on various Thai-specific natural language understanding (NLU) benchmarks, including ThaiExam, Belebele-Thai, and MMLU-TH. Notable gains include +0.36 on A-Level and +0.29 on TGAT within the ThaiExam suite.
- Diverse Training Data: The model's training corpus includes a rich mix of English (Fineweb2-ENG) and Thai datasets, along with a curated dataset covering categories like Business & Finance, News, Education, and Research Articles.
- Foundation for Instruction Tuning: This is a base model designed to be instruction fine-tuned for optimal performance on specific user requirements and use cases. Several instruction-tuned models, such as Typhoon by SCB10X and THaLLE by KBTG, have already been built upon ThaiLLM-8B.
When to Use This Model
- Developing Thai-centric LLM Applications: Ideal as a starting point for building applications that require strong performance in the Thai language.
- Instruction Fine-tuning: Recommended for users who plan to instruction fine-tune a model for specific tasks, leveraging its robust pre-trained Thai understanding.
- Research and Development: Suitable for researchers exploring continued pre-training techniques and multilingual model adaptation, particularly for low-resource languages like Thai.
Limitations
As a base model, ThaiLLM-8B requires instruction fine-tuning for optimal performance in most practical applications. Performance on highly specialized domains may necessitate further domain-specific fine-tuning.