MaLA-LM/emma-500-llama3.1-8b-mono
MaLA-LM/emma-500-llama3.1-8b-mono is an 8 billion parameter multilingual language model developed by MaLA-LM, continually pre-trained on the Llama 3.1 architecture. It supports 546 languages with substantial training data, leveraging the MaLA Corpus which includes books, code, instruction data, and papers. This model excels in massively multilingual NLP tasks such as commonsense reasoning, machine translation, and text classification, making it suitable for applications requiring broad language coverage.
Loading preview...
EMMA-500 Llama 3.1 8B Mono: Massively Multilingual Adaptation
EMMA-500 Llama 3.1 8B Mono is a continually pre-trained multilingual language model built upon the Llama 3.1 8B architecture. Developed by MaLA-LM, this model is specifically designed to enhance language representation, particularly for low-resource languages, by leveraging the extensive MaLA Corpus.
Key Capabilities
- Broad Language Support: Supports 546 languages, each with over 100k tokens of training data.
- Diverse Training Data: Utilizes a rich monolingual mix from domains including code, books, instruction data, and academic papers.
- Multilingual NLP Tasks: Excels in tasks such as commonsense reasoning, machine translation, and text classification across numerous languages.
- Continual Pre-training: Benefits from enhanced language adaptation through its continual pre-training methodology.
Good For
- Massively Multilingual Applications: Ideal for scenarios requiring broad language coverage, especially for low-resource languages.
- Research and Development: Useful for exploring multilingual NLP and language adaptation techniques.
- Machine Translation: A strong candidate for machine translation tasks due to its extensive multilingual training.
This model is part of the EMMA-500 series, focusing on monolingual data adaptation, and is trained on a total of 419 billion tokens. For more details, refer to the project website and the associated research paper.