Llama-SEA-LION-v2-8B: Southeast Asian Multilingual LLM

Llama-SEA-LION-v2-8B is an 8 billion parameter large language model developed by AI Singapore, specifically designed for the Southeast Asian (SEA) region. Built upon the Meta-Llama-3-8B-Instruct architecture, this model has undergone extensive continued pre-training on approximately 48 billion tokens across five key SEA languages: English, Indonesian, Tamil, Thai, and Vietnamese.

Key Capabilities & Features

Multilingual Proficiency: Specialized in English and four major Southeast Asian languages, enabling robust performance in these linguistic contexts.
Llama 3 Architecture: Leverages the efficient and powerful Llama 3 decoder model architecture.
Extensive Pre-training: Benefits from continued pre-training on a diverse dataset, including language-specific corpora like SEA-LION Pile and WangChanBERTa, totaling 48 billion tokens.
General Language Tasks: Evaluated on the BHASA benchmark for tasks such as Question Answering, Sentiment Analysis, Toxicity Detection, Translation, Summarization, Causal Reasoning, and Natural Language Inference.
Community License: Released under the Llama3 Community License, facilitating broader use and development.

When to Use This Model

Southeast Asian Applications: Ideal for developers building applications that require strong language understanding and generation in English, Indonesian, Tamil, Thai, or Vietnamese.
Multilingual Chatbots & Assistants: Suitable for creating conversational AI systems tailored for the SEA market.
Research & Development: Provides a strong base for further fine-tuning or research into multilingual LLMs for the region.

Note: This model has not been aligned for safety, and users are advised to perform their own safety fine-tuning.

Overview

Llama-SEA-LION-v2-8B: Southeast Asian Multilingual LLM

Key Capabilities & Features

When to Use This Model

Full Model Card (README)