MaLA-LM/emma-500-llama3.1-8b-bi
The EMMA-500 Llama 3.1 8B is an 8 billion parameter multilingual language model developed by MaLA-LM, continually pre-trained on the Llama 3.1 8B architecture. It supports 546 languages with substantial training data, leveraging the MaLA Corpus which includes bilingual translation data across 2,500+ language pairs. This model excels in multilingual tasks such as commonsense reasoning, machine translation, and text classification, particularly for low-resource languages.
Loading preview...
EMMA-500 Llama 3.1 8B: Massively Multilingual Adaptation
EMMA-500 Llama 3.1 8B is an 8 billion parameter language model from MaLA-LM, continually pre-trained on the Llama 3.1 8B base architecture. Its primary focus is enhancing language representation, especially for low-resource languages, by leveraging a diverse and extensive multilingual dataset.
Key Capabilities & Features
- Massive Multilingual Support: Supports 546 languages with over 100k tokens of training data each, and includes bilingual translation data for over 2,500 language pairs.
- Continual Pre-training: Built upon Llama 3.1 8B, it undergoes continual pre-training using the comprehensive MaLA Corpus.
- Diverse Data Mix: Trained on 671 billion tokens from a bilingual mix including code, books, instruction data, and academic papers.
- Task Performance: Designed to excel in multilingual tasks such as commonsense reasoning, machine translation, and text classification.
Use Cases & Considerations
- Ideal for: Massively multilingual NLP tasks, particularly machine translation and applications involving low-resource languages.
- Limitations: Users should be aware of potential performance regression on some tasks and high-resource languages compared to monolingual models. It is not recommended for real-world, high-stakes scenarios due to these limitations.