sail/Sailor-4B-Chat

Loading
Public
4B
BF16
32768
1
Mar 2, 2024
License: apache-2.0
Hugging Face
Overview

Overview

Sailor-4B-Chat is part of the Sailor suite of Open Language Models, developed by sail and specifically optimized for South-East Asian (SEA) languages. Built from the Qwen 1.5 architecture, this 4 billion parameter model is instruction-tuned using publicly available datasets, focusing on languages such as Indonesian, Thai, Vietnamese, Malay, and Lao. It maintains strong performance in English and Chinese, making it versatile for multilingual applications.

Key Capabilities

  • Multilingual Proficiency: Designed to understand and generate text in Indonesian, Thai, Vietnamese, Malay, Lao, English, and Chinese.
  • Instruction Following: Fine-tuned with datasets like aya_collection, aya_dataset, and OpenOrca for robust instruction-following capabilities.
  • Extended Context: Features a 32768-token context length, suitable for processing longer inputs and generating more coherent responses.
  • Benchmarked Performance: Demonstrates proficiency in tasks such as question answering and commonsense reasoning in SEA languages.

Training Details

The Sailor models undergo continuous pre-training from Qwen 1.5, leveraging a high-quality, deduplicated dataset. The pre-training corpus includes SlimPajama, SkyPile, CC100, and MADLAD-400. The 4B model was trained with 200 billion tokens, with systematic experiments conducted to optimize language weights, enhancing SEA language performance without compromising English and Chinese proficiency.

Good For

  • Applications requiring strong language understanding and generation in South-East Asian languages.
  • Developing chatbots or assistants for multilingual user bases.
  • Research and development in low-resource language NLP, particularly for SEA regions.