Overview
Gemma-SEA-LION-v4-27B is a 27 billion parameter multilingual model developed by AI Products Pillar, AI Singapore, based on the Gemma 3 architecture. It has been continued pre-trained on approximately 500 billion tokens from a bucket of over one trillion tokens, focusing on 11 Southeast Asian languages (Burmese, English, Indonesian, Khmer, Lao, Malay, Mandarin, Tagalog, Tamil, Thai, and Vietnamese) and English. The model inherits Gemma 3's substantial 128K context length and its native tokenizer.
Key Capabilities
- Multilingual Proficiency: Enhanced performance across 11 SEA languages and English due to extensive continued pre-training.
- Large Context Window: Supports a 128K token context length, enabling processing of long documents and complex queries.
- Multimodal Understanding: Inherits Gemma 3's image and text understanding capabilities, including document comprehension, visual Q&A, and image-grounded reasoning.
- Function Calling: Designed for advanced function calling and structured outputs, facilitating integration into larger systems.
Training Details
The model was continued pre-trained on a diverse dataset of 500 billion tokens, sampled from a 1 trillion token pool. This dataset includes web data, code (StarCoder), open-source datasets like Fineweb-Edu and SEA-LION Pile (v1 and v2), and synthetically generated data for low-resource languages like Khmer. Training utilized bfloat16 precision with a decoupled_adamw optimizer and a CosineAnnealing scheduler.
Limitations
Users should be aware that the model was not aligned for safety and may exhibit hallucinations or generate irrelevant content. Its vision capabilities are comparable to Gemma 3 IT 27B, as continued pre-training focused exclusively on the text backend.