sail/Sailor-1.8B-Chat

Warm
Public
1.8B
BF16
32768
License: apache-2.0
Hugging Face
Overview

Overview

Sailor-1.8B-Chat is a 1.8 billion parameter instruction-tuned language model developed by sail, designed with a strong focus on South-East Asian (SEA) languages. Built upon the Qwen 1.5 architecture, this model is part of a suite of Sailor models ranging from 0.5B to 14B parameters. It has been continually pre-trained and fine-tuned using a combination of public datasets, with careful data curation and deduplication to enhance performance in SEA languages.

Key Capabilities

  • Multilingual Proficiency: Optimized for Indonesian, Thai, Vietnamese, Malay, and Lao, while retaining strong performance in English and Chinese.
  • Instruction Following: Fine-tuned with open-source instruction datasets like aya_collection, aya_dataset, and OpenOrca for chat-based applications.
  • Reasoning and QA: Benchmarking results indicate proficiency in tasks such as question answering and commonsense reasoning.

Training Details

The model underwent continuous pre-training from Qwen 1.5, leveraging large public corpora including SlimPajama, SkyPile, CC100, and MADLAD-400. The training involved 200-400 billion tokens, with systematic experiments to balance language weights, ensuring robust performance across its target languages without significant compromise on English and Chinese capabilities. A technical report is available for more details: arxiv.org/pdf/2404.03608.pdf.

Good For

  • Applications requiring strong language understanding and generation in South-East Asian languages.
  • Chatbots and conversational AI systems targeting users in Indonesia, Thailand, Vietnam, Malaysia, and Laos.
  • Research and development in multilingual NLP, particularly for low-resource SEA languages.