Llama-SEA-LION-v3-70B: Multilingual LLM for Southeast Asia

Llama-SEA-LION-v3-70B is a 70 billion parameter large language model developed by AI Singapore, building on the Llama 3.1-70B-Instruct architecture. It has been extensively pre-trained on approximately 200 billion tokens across 11 Southeast Asian (SEA) languages: Burmese, Chinese, English, Filipino, Indonesian, Khmer, Lao, Malay, Tamil, Thai, and Vietnamese. This continued pre-training aims to enhance its understanding and generation capabilities specifically for the SEA region.

Key Capabilities

Multilingual Proficiency: Strong performance across 11 SEA languages due to targeted pre-training.
General Language Tasks: Evaluated on tasks such as Question Answering, Sentiment Analysis, Toxicity Detection, Translation, Abstractive Summarization, Causal Reasoning, and Natural Language Inference using the SEA-HELM benchmark.
Constraint Following: Assessed for its ability to adhere to specific instructions and constraints in both English and SEA languages via SEA-IFEval, a localized version of IFEval.
Extensive Training Data: Utilizes a diverse dataset including SEA-LION Pile v1 and v2, Dolma, Fineweb-Edu, StackV2 (for code), and other language-specific corpora.

Good For

Applications requiring robust language understanding and generation in multiple Southeast Asian languages.
Developers building solutions for multilingual contexts within the SEA region.
Research and development focused on improving LLM performance for less-resourced languages.

Overview

Llama-SEA-LION-v3-70B: Multilingual LLM for Southeast Asia

Key Capabilities

Good For

Full Model Card (README)