sail/Sailor-14B

Cold
Public
14.2B
FP8
32768
License: apache-2.0
Hugging Face
Overview

Sailor-14B: South-East Asian Language Model

Sailor-14B is a 14.2 billion parameter model from the Sailor suite, developed by sail, focusing on linguistic understanding and generation for South-East Asian (SEA) languages. It is built on the Qwen 1.5 architecture and has been continually pre-trained to enhance its performance across Indonesian, Thai, Vietnamese, Malay, and Lao, alongside English and Chinese.

Key Capabilities & Features

  • Multilingual Proficiency: Optimized for SEA languages, demonstrating strong performance in question answering and commonsense reasoning tasks.
  • Robust Training: Underwent extensive pre-training with 200 billion tokens on a high-quality, deduplicated corpus including SlimPajama, SkyPile, CC100, and MADLAD-400.
  • Instruction-Tuned Variants: Base models are further fine-tuned with open-source datasets to create instruction-tuned versions (Sailor-Chat).
  • Context Length: Supports a context window of 32768 tokens.

Use Cases & Differentiators

Sailor-14B is particularly suited for applications requiring deep linguistic understanding and generation in the specified South-East Asian languages. Its specialized training makes it a strong candidate for tasks where general-purpose models might underperform in these regional contexts. The model maintains strong performance in English and Chinese, offering a versatile solution for multilingual environments. For more technical details, refer to the technical report.