ThaiLLM/ThaiLLM-8B
ThaiLLM/ThaiLLM-8B is a continued pre-trained base model derived from Qwen3-8B-Base, trained on a diverse corpus of approximately 63 billion tokens. It is specifically optimized for Thai language understanding, demonstrating significant improvements in Thai-specific benchmarks like ThaiExam and Belebele-Thai compared to its base model. This model serves as a strong foundation for instruction fine-tuning to align with specific user requirements and use cases, particularly for applications requiring robust Thai language processing.
Loading preview...
ThaiLLM-8B: A Qwen3-8B-Base Continued Pre-trained Model for Thai Language
ThaiLLM-8B is a base language model that has undergone continued pre-training from the Qwen3-8B-Base architecture. It was trained on an extensive and diverse corpus totaling approximately 63 billion tokens, with a significant portion dedicated to Thai language data.
Key Characteristics & Capabilities
- Enhanced Thai Language Understanding: Demonstrates substantial improvements in Thai-specific Natural Language Understanding (NLU) benchmarks, including ThaiExam (+0.16896 average gain), Belebele-Thai (+0.4567 gain), and MMLU-TH (+0.0200 gain) compared to Qwen3-8B-Base.
- Diverse Training Data: The training corpus includes 24 billion English tokens (Fineweb2-ENG), over 31.5 billion Thai tokens (Fineweb2-TH), and 8 billion tokens from a curated dataset covering categories like Business & Finance, News, Education, Research Articles, and Law.
- Base Model for Fine-tuning: Designed as a foundational model that requires instruction fine-tuning for optimal performance in specific user requirements and use cases. Several instruction-tuned models, such as Typhoon, THaLLE, OpenThaiGPT, and Pathumma, have been built upon ThaiLLM-8B.
Usage and Recommendations
- Instruction Fine-tuning: It is strongly recommended to instruction fine-tune ThaiLLM-8B for specific tasks. The README suggests using LLaMA-Factory for this purpose, providing a quick-start guide.
- Limitations: As a base model, its out-of-the-box performance may not be optimal for all tasks without fine-tuning. Performance in specialized domains may also require domain-specific fine-tuning, and outputs should always be verified for accuracy in critical applications.