Overview
Sailor-0.5B: A Language Model for South-East Asia
Sailor-0.5B is part of the Sailor suite of open language models, developed by sail, specifically designed to cater to the linguistic diversity of South-East Asia. Built upon the robust Qwen 1.5 architecture, this 0.6 billion parameter model focuses on languages such as Indonesian, Thai, Vietnamese, Malay, and Lao.
Key Capabilities & Features
- Multilingual Proficiency: Optimized for understanding and generating text in key South-East Asian languages, while maintaining proficiency in English and Chinese.
- Strong Foundation: Continuously pre-trained from Qwen 1.5 models, leveraging their existing performance in SEA languages.
- High-Quality Training Data: Utilizes a carefully curated and deduplicated corpus from sources like SlimPajama, SkyPile, CC100, and MADLAD-400.
- Instruction-Tuned Variants: The Sailor suite also includes instruction-tuned models (Sailor-Chat) for enhanced conversational capabilities.
- Benchmarked Performance: Demonstrates strong performance in tasks such as question answering and commonsense reasoning within SEA languages.
Training Details
Sailor-0.5B underwent continuous pre-training with 400 billion tokens, with systematic experiments to determine optimal language weights. This process enhances its performance on SEA languages without significantly compromising its capabilities in English and Chinese.
Use Cases
- Applications requiring robust language understanding and generation in Indonesian, Thai, Vietnamese, Malay, and Lao.
- Research and development in low-resource or underrepresented South-East Asian languages.
- Building chatbots, translation tools, or content generation systems tailored for the SEA region.