AIDC-AI/Marco-LLM-SEA
AIDC-AI/Marco-LLM-SEA is a 7.6 billion parameter Transformer-based language model specifically fine-tuned for Southeast Asian languages, including Indonesian, Malaysian, Thai, and Vietnamese. Developed by AIDC-AI, it underwent extensive continued pretraining on approximately 56 billion tokens to enhance regional language capabilities while maintaining general benchmark competitiveness. This model features an improved tokenizer adaptive to multiple Southeast Asian languages and scripts, making it ideal for applications requiring strong performance in these specific linguistic contexts.
Loading preview...
Overview
Marco-LLM-SEA is a series of language models developed by AIDC-AI, specifically designed and fine-tuned for Southeast Asian languages. This 7.6 billion parameter model is part of a larger family ranging from 7B to 72B, including both base and instruction-tuned variants.
Key Capabilities
- Multilingual Focus: Enhanced capabilities across Indonesian, Malaysian, Thai, Vietnamese, and other regional Southeast Asian languages.
- Continued Pretraining: Underwent extensive continued pretraining on approximately 56 billion tokens, improving its proficiency in target languages.
- Advanced Architecture: Based on the Transformer architecture, incorporating SwiGLU activation, attention QKV bias, and group query attention.
- Adaptive Tokenizer: Utilizes an improved tokenizer specifically adapted for multiple Southeast Asian languages and scripts.
Usage Recommendations
This base model is not intended for direct text generation without further adaptation. Developers are advised to apply post-training methods such as Supervised Fine-tuning (SFT), Reinforcement Learning with Human Feedback (RLHF), or additional continued pretraining to tailor it for specific use cases. For more details, refer to the Hugging Face page and the associated research paper: Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement.