AfriqueGemma-4B: Multilingual LLM for African Languages
AfriqueGemma-4B is a 4.3 billion parameter causal language model developed by McGill-NLP, building upon Google's Gemma 3 4B PT. It is a core component of the AfriqueLLM suite, specifically designed to enhance language understanding and generation across 20 African languages through extensive continued pre-training (CPT) on 25.2 billion tokens.
Key Capabilities
- Multilingual Proficiency: Adapted for 20 African languages (e.g., Swahili, Hausa, Yoruba, Amharic) while mitigating catastrophic forgetting in high-resource languages like English, French, Portuguese, and Arabic.
- Robust Training Data: Trained on a diverse corpus including African monolingual data (FineWeb2, WURA, MADLAD-400), 1 billion tokens of code (CornStack-Python), 1 billion tokens of mathematics (FineMath-4+), and 324 million tokens of GPT-4.1 translated synthetic data.
- Improved Performance: Demonstrates significant gains over its base model (Gemma3-4B) on various multilingual benchmarks, including AfriMGSM, AfriMMLU, and FLORES, with an overall improvement of 22.2% on the AfriqueLLM evaluation suite.
- Context Length: Features a native context length of 8,192 tokens, with training conducted at 16,384 tokens.
Good For
- Applications requiring strong performance in low-resource African languages.
- Multilingual tasks where balancing performance across diverse linguistic contexts is crucial.
- Research and development in African NLP, leveraging its specialized training data and evaluation.
For deployment, the model supports vllm and sglang for creating OpenAI-compatible API endpoints.