EMMA-500 Llama 3.1 8B Mono: Massively Multilingual Adaptation
EMMA-500 Llama 3.1 8B Mono is a continually pre-trained multilingual language model built upon the Llama 3.1 8B architecture. Developed by MaLA-LM, this model is specifically designed to enhance language representation, particularly for low-resource languages, by leveraging the extensive MaLA Corpus.
Key Capabilities
- Broad Language Support: Supports 546 languages, each with over 100k tokens of training data.
- Diverse Training Data: Utilizes a rich monolingual mix from domains including code, books, instruction data, and academic papers.
- Multilingual NLP Tasks: Excels in tasks such as commonsense reasoning, machine translation, and text classification across numerous languages.
- Continual Pre-training: Benefits from enhanced language adaptation through its continual pre-training methodology.
Good For
- Massively Multilingual Applications: Ideal for scenarios requiring broad language coverage, especially for low-resource languages.
- Research and Development: Useful for exploring multilingual NLP and language adaptation techniques.
- Machine Translation: A strong candidate for machine translation tasks due to its extensive multilingual training.
This model is part of the EMMA-500 series, focusing on monolingual data adaptation, and is trained on a total of 419 billion tokens. For more details, refer to the project website and the associated research paper.