Overview
Overview
Sailor-4B is a 4 billion parameter language model from the Sailor suite, developed by sail and based on the Qwen 1.5 architecture. It is specifically designed and continually pre-trained to excel in South-East Asian (SEA) languages, including Indonesian, Thai, Vietnamese, Malay, and Lao. The model maintains proficiency in English and Chinese while boosting performance in SEA languages through aggressive data deduplication and careful data cleaning on its training corpus.
Key Capabilities
- Multilingual Proficiency: Optimized for Indonesian, Thai, Vietnamese, Malay, and Lao, alongside English and Chinese.
- Robust Training: Continually pre-trained on 200 billion tokens from diverse public datasets like SlimPajama, SkyPile, CC100, and MADLAD-400.
- Task Performance: Demonstrates strong performance in question answering and commonsense reasoning tasks within SEA languages.
- Context Length: Supports a substantial context length of 32768 tokens.
Good for
- Applications requiring high-quality text understanding and generation in South-East Asian languages.
- Developers building multilingual applications targeting the SEA region.
- Research and development in low-resource language processing, particularly for the specified SEA languages.
Technical Details
Sailor models are available in various sizes, from 0.5B to 14B, with instruction-tuned 'Sailor-Chat' versions also available. The project provides a technical report detailing its development and benchmarking results.