AfriqueLlama-8B: Multilingual Model for African Languages
AfriqueLlama-8B, developed by McGill-NLP, is an 8 billion parameter causal language model built upon Meta's Llama 3.1 8B architecture. It is a key component of the AfriqueLLM suite, which focuses on adapting open language models for improved performance across 20 African languages through continued pre-training (CPT).
Key Capabilities and Features
- Multilingual Adaptation: Specifically adapted for 20 African languages (e.g., Swahili, Hausa, Yoruba, Amharic) while retaining proficiency in high-resource languages like English, French, Portuguese, and Arabic.
- Extensive Continued Pre-training: Underwent CPT on 29.6 billion tokens of curated multilingual data, including African monolingual data, code (CornStack-Python), mathematics (FineMath-4+), and GPT-4.1 translated synthetic data.
- Balanced Data Distribution: Utilizes UniMax sampling to balance data, capping high-resource languages and upsampling lower-resource ones for effective learning.
- Performance Improvement: Demonstrates a significant performance uplift of +14.7 points (42.2%) on a suite of multilingual benchmarks compared to its base Llama 3.1 8B model, particularly excelling in tasks like FLORES and INJONG.
- Context Length: Features a native context length of 8,192 tokens, extendable with RoPE scaling.
Good For
- Applications requiring strong performance in African languages: Ideal for tasks such as text generation, translation, and understanding in low-resource African linguistic contexts.
- Multilingual systems: Suitable for developers building applications that need to operate across a diverse set of languages, including both African and major global languages.
- Research and development: Provides a robust base model for further fine-tuning or research into African language NLP.