Name: GoToCompany/llama3-8b-cpt-sahabatai-v1-base API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: GoToCompany

Sahabat-AI v1: Llama 3 for Indonesian Languages

GoToCompany/llama3-8b-cpt-sahabatai-v1-base is an 8 billion parameter Llama 3 model, co-initiated by GoTo Group and Indosat Ooredoo Hutchison, and developed by PT GoTo Gojek Tokopedia Tbk and AI Singapore. It is specifically designed for the Indonesian language and its dialects, including Javanese and Sundanese, building upon the AI Singapore-Llama-3-8B-Sea-Lion v2.1-Instruct model.

Key Capabilities

Multilingual Proficiency: Continued pre-training on 50 billion tokens, with a significant portion dedicated to Indonesian, Javanese, and Sundanese data, enabling strong performance in these languages.
General Language Tasks: Evaluated on the SEA HELM (BHASA) benchmark for tasks like Question Answering, Sentiment Analysis, Toxicity Detection, Translation, Summarization, Causal Reasoning, and Natural Language Inference.
Context Length: Features an 8192-token context window, utilizing the default Llama-3-8B tokenizer.

Training Details

The model was continued pre-trained on a diverse dataset, including 27.5 billion tokens from the SEA-LION Pile - Indonesian, 1.5 billion Javanese tokens, and 0.75 billion Sundanese tokens, alongside other sources like Dolma Refined Web and Star Coder. Training was conducted on 32 Nvidia H100 80GB GPUs for 5 days using MosaicML Composer.

Good For

This model is ideal for developers and researchers focusing on applications requiring robust language understanding and generation in Indonesian, Javanese, and Sundanese. Its specialized training makes it a strong candidate for tasks such as content creation, customer support, and data analysis within these linguistic contexts.

Overview

Sahabat-AI v1: Llama 3 for Indonesian Languages

Key Capabilities

Training Details

Good For

Full Model Card (README)