sail/Sailor-1.8B

Warm
Public
1.8B
BF16
32768
1
Feb 29, 2024
License: apache-2.0
Hugging Face
Overview

Sailor-1.8B: South-East Asian Language Model

Sailor-1.8B is part of the Sailor suite of open language models, specifically designed to cater to the linguistic diversity of South-East Asia (SEA). Developed by sail and built on the robust Qwen 1.5 architecture, this 1.8 billion parameter model focuses on languages such as Indonesian, Thai, Vietnamese, Malay, and Lao.

Key Capabilities & Features

  • Multilingual Proficiency: Optimized for SEA languages, with strong performance in Indonesian, Thai, Vietnamese, Malay, and Lao, while retaining English and Chinese capabilities.
  • Strong Foundation: Continuously pre-trained from Qwen 1.5 models, leveraging their existing strengths in SEA languages.
  • Data Curation: Benefits from aggressive data deduplication and careful cleaning of a high-quality corpus, including datasets like SlimPajama, SkyPile, CC100, and MADLAD-400.
  • Instruction-Tuned Variants: Base models are further fine-tuned with open-source datasets to create instruction-tuned versions (Sailor-Chat).
  • Benchmarked Performance: Demonstrates proficiency in tasks such as question answering and commonsense reasoning in SEA languages.

Training Details

Sailor models undergo training with 200B to 400B tokens, with this specific 1.8B parameter model being continually pre-trained with 200 billion tokens. This systematic approach, including careful weighting of different languages during training, enhances performance in SEA languages without significantly compromising English and Chinese proficiency.

Use Cases

Sailor-1.8B is ideal for applications requiring strong language understanding and generation in South-East Asian contexts, including:

  • Developing chatbots or virtual assistants for SEA markets.
  • Content generation and summarization in Indonesian, Thai, Vietnamese, Malay, and Lao.
  • Research and development in multilingual NLP for the SEA region.