Overview
Sailor-1.8B-Chat is a 1.8 billion parameter instruction-tuned language model developed by sail, designed with a strong focus on South-East Asian (SEA) languages. Built upon the Qwen 1.5 architecture, this model is part of a suite of Sailor models ranging from 0.5B to 14B parameters. It has been continually pre-trained and fine-tuned using a combination of public datasets, with careful data curation and deduplication to enhance performance in SEA languages.
Key Capabilities
- Multilingual Proficiency: Optimized for Indonesian, Thai, Vietnamese, Malay, and Lao, while retaining strong performance in English and Chinese.
- Instruction Following: Fine-tuned with open-source instruction datasets like aya_collection, aya_dataset, and OpenOrca for chat-based applications.
- Reasoning and QA: Benchmarking results indicate proficiency in tasks such as question answering and commonsense reasoning.
Training Details
The model underwent continuous pre-training from Qwen 1.5, leveraging large public corpora including SlimPajama, SkyPile, CC100, and MADLAD-400. The training involved 200-400 billion tokens, with systematic experiments to balance language weights, ensuring robust performance across its target languages without significant compromise on English and Chinese capabilities. A technical report is available for more details: arxiv.org/pdf/2404.03608.pdf.
Good For
- Applications requiring strong language understanding and generation in South-East Asian languages.
- Chatbots and conversational AI systems targeting users in Indonesia, Thailand, Vietnam, Malaysia, and Laos.
- Research and development in multilingual NLP, particularly for low-resource SEA languages.