GoToCompany/llama3-8b-cpt-sahabatai-v1-base

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kTool Calling:SupportedPublished:Nov 6, 2024License:llama3Architecture:Transformer0.0K Cold

The GoToCompany/llama3-8b-cpt-sahabatai-v1-base is an 8 billion parameter Llama 3 decoder-only language model developed by PT GoTo Gojek Tokopedia Tbk and AI Singapore, with a context length of 8192 tokens. This model has undergone continued pre-training on approximately 50 billion tokens, specifically optimized for Indonesian language and its various dialects, including Javanese and Sundanese. It excels in general language capabilities across these Southeast Asian languages, making it suitable for applications requiring strong multilingual understanding in the region.

Loading preview...

Sahabat-AI v1: Llama 3 for Indonesian Languages

GoToCompany/llama3-8b-cpt-sahabatai-v1-base is an 8 billion parameter Llama 3 model, co-initiated by GoTo Group and Indosat Ooredoo Hutchison, and developed by PT GoTo Gojek Tokopedia Tbk and AI Singapore. It is specifically designed for the Indonesian language and its dialects, including Javanese and Sundanese, building upon the AI Singapore-Llama-3-8B-Sea-Lion v2.1-Instruct model.

Key Capabilities

  • Multilingual Proficiency: Continued pre-training on 50 billion tokens, with a significant portion dedicated to Indonesian, Javanese, and Sundanese data, enabling strong performance in these languages.
  • General Language Tasks: Evaluated on the SEA HELM (BHASA) benchmark for tasks like Question Answering, Sentiment Analysis, Toxicity Detection, Translation, Summarization, Causal Reasoning, and Natural Language Inference.
  • Context Length: Features an 8192-token context window, utilizing the default Llama-3-8B tokenizer.

Training Details

The model was continued pre-trained on a diverse dataset, including 27.5 billion tokens from the SEA-LION Pile - Indonesian, 1.5 billion Javanese tokens, and 0.75 billion Sundanese tokens, alongside other sources like Dolma Refined Web and Star Coder. Training was conducted on 32 Nvidia H100 80GB GPUs for 5 days using MosaicML Composer.

Good For

This model is ideal for developers and researchers focusing on applications requiring robust language understanding and generation in Indonesian, Javanese, and Sundanese. Its specialized training makes it a strong candidate for tasks such as content creation, customer support, and data analysis within these linguistic contexts.