WangchanLION-v3: A Thai-Focused Multilingual LLM
WangchanLION-v3 is a collaborative effort by AI Singapore and VISTEC, specifically designed to advance Large Language Models (LLMs) for Southeast Asian languages, with a strong emphasis on Thai. This 8 billion parameter decoder model is built on the Llama3 architecture and has undergone extensive continual pre-training using approximately 47.4 billion Thai samples from both web and curated datasets.
Key Capabilities & Features
- Thai Language Specialization: Primarily pre-trained and instruct-tuned for the Thai language, making it highly proficient in Thai-specific contexts.
- Multilingual Support: Supports both English and Thai languages.
- Extended Context Length: Features a notable context length of 128k tokens, allowing for processing longer inputs.
- Llama3.1 Architecture: Leverages the robust Llama3.1 8B CPT SEA-LIONv3 instruction base and its default tokenizer.
When to Use This Model
- Thai Language Applications: Ideal for use cases requiring deep understanding and generation in Thai.
- Supervised Fine-Tuning (SFT): Best suited for scenarios where supervised fine-tuning is applied, as it is not trained for system prompts or tool calling.
- Research and Development: A valuable resource for researchers and developers focusing on Southeast Asian language models.
Important Considerations
Users should be aware that WangchanLION-v3, like many LLMs, may exhibit limitations such as hallucination and occasional generation of irrelevant content. The model has not been aligned for safety, and developers are advised to perform their own safety fine-tuning and related security measures.