sail/Sailor-0.5B-Chat

Warm
Public
0.6B
BF16
32768
Mar 2, 2024
License: apache-2.0
Hugging Face
Overview

Sailor-0.5B-Chat: South-East Asian Language Model

Sailor-0.5B-Chat is a 0.6 billion parameter instruction-tuned model from the Sailor suite, developed by sail. It is built on the Qwen 1.5 architecture and specifically optimized for South-East Asian (SEA) languages, including Indonesian, Thai, Vietnamese, Malay, and Lao. The model also retains strong performance in English and Chinese.

Key Capabilities & Training

  • Multilingual Proficiency: Designed to understand and generate text across the diverse linguistic landscape of the SEA region.
  • Instruction-Tuned: Fine-tuned with publicly available datasets like aya_collection, aya_dataset, and OpenOrca to enhance its conversational abilities.
  • Robust Pre-training: Continuously pre-trained from Qwen 1.5 models using a high-quality, deduplicated corpus including SlimPajama, SkyPile, CC100, and MADLAD-400.
  • Optimized for SEA: Underwent systematic experiments to determine optimal weights for different SEA languages, with the 0.5B model trained on 400 billion tokens.
  • Benchmarked Performance: Demonstrates proficiency in tasks such as question answering and commonsense reasoning in SEA languages.

Use Cases

  • Applications requiring strong language understanding and generation in Indonesian, Thai, Vietnamese, Malay, and Lao.
  • Chatbots and conversational AI systems targeting users in South-East Asia.
  • Research and development in low-resource language NLP for the SEA region.