Sailor-7B is a 7.7 billion parameter language model developed by sail, built upon the Qwen 1.5 architecture with a 32768 token context length. It is specifically tailored for South-East Asian (SEA) languages, including Indonesian, Thai, Vietnamese, Malay, and Lao. The model excels at understanding and generating text in these diverse linguistic contexts, making it suitable for applications requiring strong SEA language proficiency.
Loading preview...
Sailor-7B: Open Language Model for South-East Asia
Sailor-7B is a 7.7 billion parameter model from the Sailor suite, developed by sail and based on the Qwen 1.5 architecture. This model is uniquely optimized for South-East Asian (SEA) languages, including Indonesian, Thai, Vietnamese, Malay, and Lao, while maintaining proficiency in English and Chinese.
Key Capabilities
- Multilingual Proficiency: Designed to understand and generate text across diverse SEA linguistic landscapes.
- Strong Performance: Benchmarked for tasks such as question answering and commonsense reasoning in SEA languages.
- Robust Training: Continuously pre-trained on 200 billion tokens from a high-quality, deduplicated corpus including SlimPajama, SkyPile, CC100, and MADLAD-400, with a focus on balancing language weights.
- Instruction-Tuned Variants: Base models are further fine-tuned with open-source datasets to create instruction-following 'Sailor-Chat' versions.
Good For
- Applications requiring high-quality language understanding and generation in Indonesian, Thai, Vietnamese, Malay, and Lao.
- Research and commercial use under the Apache 2.0 License, with specific considerations for Qwen's license for large-scale commercial deployments (over 100 million monthly active users).
- Developers looking for a model with a 32768 token context length optimized for the linguistic nuances of the SEA region.