MaLA-LM/emma-500-llama2-7b
EMMA-500 is a 7 billion parameter multilingual language model developed by MaLA-LM, built upon the Llama 2 architecture. It is continually pre-trained on the MaLA Corpus, encompassing over 500 languages and 74 billion tokens. This model excels in diverse multilingual tasks such as commonsense reasoning, machine translation, and text classification, particularly enhancing performance in low-resource languages. EMMA-500 is designed for robust multilingual text generation and understanding across 546 languages.
Loading preview...
EMMA-500: Massively Multilingual Llama 2 Adaptation
EMMA-500 is a 7 billion parameter language model from MaLA-LM, based on the Llama 2 architecture, specifically designed for enhanced multilingual adaptation. It undergoes continual pre-training using the extensive MaLA Corpus, which includes over 74 billion tokens across 500+ languages. This process significantly improves language representation, especially for low-resource languages.
Key Capabilities
- Massively Multilingual: Supports 546 languages with substantial training data (over 100k tokens each).
- Diverse Task Performance: Excels in a wide range of tasks including commonsense reasoning, machine translation, open-ended generation, and text classification.
- Enhanced Llama 2 Performance: Outperforms other Llama 2-based models in various multilingual settings and specialized tasks like natural language inference and code generation.
- Robustness: Maintains strong performance while improving multilingual capabilities.
Performance Highlights
Evaluations show EMMA-500 achieves the lowest negative log-likelihood in intrinsic assessments and demonstrates significant improvements in key areas. It notably surpasses other Llama 2 models in text classification and natural language inference, and shows enhanced performance in machine reading comprehension. While challenges remain in output diversity for very low-resource languages (indicated by higher Self-BLEU scores), its overall multilingual proficiency is a key differentiator.
Good For
- Applications requiring massively multilingual text generation.
- Tasks involving commonsense reasoning and machine translation across many languages.
- Text classification and natural language inference in diverse linguistic contexts.
- Research and development focusing on low-resource language processing.