Overview
Sailor-7B: Open Language Model for South-East Asia
Sailor-7B is a 7.7 billion parameter model from the Sailor suite, developed by sail and based on the Qwen 1.5 architecture. This model is uniquely optimized for South-East Asian (SEA) languages, including Indonesian, Thai, Vietnamese, Malay, and Lao, while maintaining proficiency in English and Chinese.
Key Capabilities
- Multilingual Proficiency: Designed to understand and generate text across diverse SEA linguistic landscapes.
- Strong Performance: Benchmarked for tasks such as question answering and commonsense reasoning in SEA languages.
- Robust Training: Continuously pre-trained on 200 billion tokens from a high-quality, deduplicated corpus including SlimPajama, SkyPile, CC100, and MADLAD-400, with a focus on balancing language weights.
- Instruction-Tuned Variants: Base models are further fine-tuned with open-source datasets to create instruction-following 'Sailor-Chat' versions.
Good For
- Applications requiring high-quality language understanding and generation in Indonesian, Thai, Vietnamese, Malay, and Lao.
- Research and commercial use under the Apache 2.0 License, with specific considerations for Qwen's license for large-scale commercial deployments (over 100 million monthly active users).
- Developers looking for a model with a 32768 token context length optimized for the linguistic nuances of the SEA region.