Sailor-1.8B-Chat is a 1.8 billion parameter instruction-tuned causal language model developed by sail, built upon the Qwen 1.5 architecture. It is specifically tailored for South-East Asian languages including Indonesian, Thai, Vietnamese, Malay, and Lao, while maintaining proficiency in English and Chinese. This model excels in tasks such as question answering and commonsense reasoning across these diverse linguistic landscapes.
Loading preview...
Overview
Sailor-1.8B-Chat is a 1.8 billion parameter instruction-tuned language model developed by sail, designed with a strong focus on South-East Asian (SEA) languages. Built upon the Qwen 1.5 architecture, this model is part of a suite of Sailor models ranging from 0.5B to 14B parameters. It has been continually pre-trained and fine-tuned using a combination of public datasets, with careful data curation and deduplication to enhance performance in SEA languages.
Key Capabilities
- Multilingual Proficiency: Optimized for Indonesian, Thai, Vietnamese, Malay, and Lao, while retaining strong performance in English and Chinese.
- Instruction Following: Fine-tuned with open-source instruction datasets like aya_collection, aya_dataset, and OpenOrca for chat-based applications.
- Reasoning and QA: Benchmarking results indicate proficiency in tasks such as question answering and commonsense reasoning.
Training Details
The model underwent continuous pre-training from Qwen 1.5, leveraging large public corpora including SlimPajama, SkyPile, CC100, and MADLAD-400. The training involved 200-400 billion tokens, with systematic experiments to balance language weights, ensuring robust performance across its target languages without significant compromise on English and Chinese capabilities. A technical report is available for more details: arxiv.org/pdf/2404.03608.pdf.
Good For
- Applications requiring strong language understanding and generation in South-East Asian languages.
- Chatbots and conversational AI systems targeting users in Indonesia, Thailand, Vietnam, Malaysia, and Laos.
- Research and development in multilingual NLP, particularly for low-resource SEA languages.