EMMA-500: Massively Multilingual Llama 2 Adaptation
EMMA-500 is a 7 billion parameter language model from MaLA-LM, based on the Llama 2 architecture, specifically designed for enhanced multilingual adaptation. It undergoes continual pre-training using the extensive MaLA Corpus, which includes over 74 billion tokens across 500+ languages. This process significantly improves language representation, especially for low-resource languages.
Key Capabilities
- Massively Multilingual: Supports 546 languages with substantial training data (over 100k tokens each).
- Diverse Task Performance: Excels in a wide range of tasks including commonsense reasoning, machine translation, open-ended generation, and text classification.
- Enhanced Llama 2 Performance: Outperforms other Llama 2-based models in various multilingual settings and specialized tasks like natural language inference and code generation.
- Robustness: Maintains strong performance while improving multilingual capabilities.
Performance Highlights
Evaluations show EMMA-500 achieves the lowest negative log-likelihood in intrinsic assessments and demonstrates significant improvements in key areas. It notably surpasses other Llama 2 models in text classification and natural language inference, and shows enhanced performance in machine reading comprehension. While challenges remain in output diversity for very low-resource languages (indicated by higher Self-BLEU scores), its overall multilingual proficiency is a key differentiator.
Good For
- Applications requiring massively multilingual text generation.
- Tasks involving commonsense reasoning and machine translation across many languages.
- Text classification and natural language inference in diverse linguistic contexts.
- Research and development focusing on low-resource language processing.