Overview
Overview
Sailor-7B-Chat is a 7.7 billion parameter instruction-tuned language model from the Sailor suite, developed by sail. It is built upon the robust Qwen 1.5 architecture and specifically optimized for South-East Asian (SEA) languages, including Indonesian, Thai, Vietnamese, Malay, and Lao. The model also maintains strong performance in English and Chinese.
Key Capabilities & Training
- Multilingual Proficiency: Designed to understand and generate text across diverse SEA linguistic landscapes, alongside English and Chinese.
- Instruction-Tuned: Fine-tuned with publicly available instruction datasets such as aya_collection, aya_dataset, and OpenOrca.
- Extensive Pre-training: Continuously pre-trained on 200-400 billion tokens from corpora like SlimPajama, SkyPile, CC100, and MADLAD-400, with aggressive data deduplication and cleaning.
- Performance: Benchmarking indicates proficiency in tasks like question answering and commonsense reasoning in SEA languages.
- Context Length: Supports a context length of 32768 tokens.
Use Cases
Sailor-7B-Chat is particularly well-suited for applications requiring strong language understanding and generation in South-East Asian languages. Its instruction-tuned nature makes it effective for conversational AI, question answering systems, and other tasks where precise instruction following is crucial, especially within the SEA region.