Sahabat-AI/llama3-8b-cpt-sahabatai-v1-base
Sahabat-AI/llama3-8b-cpt-sahabatai-v1-base is an 8 billion parameter Llama 3-based decoder-only language model developed by PT GoTo Gojek Tokopedia Tbk and AI Singapore. It has undergone continued pre-training on approximately 50 billion tokens, specifically optimized for Indonesian language and its various dialects, including Javanese and Sundanese. With an 8192-token context length, this model excels in general language capabilities across these Southeast Asian languages, making it suitable for applications requiring strong Indonesian linguistic understanding.
Loading preview...
Sahabat-AI/llama3-8b-cpt-sahabatai-v1-base Overview
This model is an 8 billion parameter Llama 3-based language model, part of the Sahabat-AI ecosystem, co-initiated by Indonesian tech and telecommunication companies GoTo Group and Indosat Ooredoo Hutchison. It was developed by PT GoTo Gojek Tokopedia Tbk and AI Singapore, building upon the AI Singapore-Llama-3-8B-Sea-Lion v2.1-Instruct model.
Key Capabilities & Training
The model has undergone continued pre-training on approximately 50 billion tokens, with a significant focus on Indonesian (55%), Javanese (3%), and Sundanese (1.5%) data, alongside English and other general datasets. It utilizes the default Llama-3-8B tokenizer and supports an 8192-token context length. Training was conducted on 32 Nvidia H100 80GB GPUs for 5 days using MosaicML Composer.
Benchmark Performance
Evaluated on the SEA HELM (BHASA) benchmark, which covers tasks like QA, Sentiment Analysis, Toxicity Detection, Translation, Summarization, Causal Reasoning, and NLI across Indonesian, Javanese, and Sundanese. The sahabatai-v1-8B model achieved an overall score of 59.437, demonstrating strong performance in these languages, particularly in Javanese (65.048) and Sundanese (59.809). While its English performance on the HuggingFace LLM Leaderboard tasks (average 13.92) is lower compared to some English-centric models, its strength lies in its specialized multilingual capabilities for Southeast Asian languages.
Ideal Use Cases
- Applications requiring robust understanding and generation in Indonesian, Javanese, and Sundanese.
- Developing AI-based services and applications tailored for the Indonesian market.
- Research and development focusing on low-resource Southeast Asian languages.