Llama3 8B CPT Sahabat-AI v1 Instruct Overview
This model is an 8 billion parameter Llama3-based decoder model, co-initiated by GoTo Group and Indosat Ooredoo Hutchison, and developed by PT GoTo Gojek Tokopedia Tbk and AI Singapore. It features a context length of 8192 tokens and is uniquely focused on the Indonesian language and its dialects, Javanese and Sundanese, while also supporting English.
Key Capabilities
- Multilingual Proficiency: Fine-tuned with 448,000 Indonesian, 96,000 Javanese, 98,000 Sundanese, and 129,000 English instruction-completion pairs.
- Instruction Following: Evaluated on IFEval, demonstrating strong adherence to prompt constraints in Bahasa Indonesia.
- General Language Understanding: Benchmarked on SEA HELM (BHASA) and IndoMMLU for Indonesian and its dialects, and common English tasks from the HuggingFace LLM Leaderboard.
- Performance: Achieves competitive scores on Indonesian language benchmarks, particularly strong in Javanese and Sundanese compared to other models in its class.
Good for
- Applications requiring robust understanding and generation in Indonesian, Javanese, and Sundanese.
- Instruction-following tasks in these languages.
- Developers building AI services for the Indonesian market and its diverse linguistic landscape.
Limitations
Users should be aware of potential hallucinations and occasional irrelevant content generation. The model is not aligned for safety, and developers are advised to implement their own safety fine-tuning.