Atlas-Chat-9B: Moroccan Darija Language Model

Atlas-Chat-9B is a 9 billion parameter instruction-tuned model developed by MBZUAI France Lab, specifically designed for Moroccan Darija, the colloquial Arabic of Morocco. It is built upon the Gemma 2 architecture and is part of a family of models (2B, 9B, 27B) aimed at making advanced AI accessible for Darija speakers.

Key Capabilities

Darija Language Generation: Excels in generating fluent and contextually rich Moroccan Darija text.
Multifaceted Applications: Designed for question answering, summarization, and translation tasks in informal Darija.
Resource-Efficient Deployment: Its compact size allows deployment on devices like laptops, desktops, or personal cloud setups.
Cultural Relevance: Supports conversational agents, content generation, and cultural research related to Morocco and its language.

Performance Highlights

Atlas-Chat-9B demonstrates strong performance across various Darija-specific benchmarks, outperforming several other models in its size class. For instance, it achieves 58.23 on DarijaMMLU, 43.65 on DarijaHellaSwag, and 95.62 on DarijaAlpacaEval. In standard NLP tasks, it shows significant improvements in translation (e.g., 28.08 BLEU on DODa-10k) and sentiment analysis (81.89%).

Training Details

The model was trained on approximately 450,000 instructions from the Darija-SFT-Mixture dataset, which includes synthetic instructions tailored to Moroccan culture, samples from publicly available Moroccan Arabic datasets, and translated English/multilingual instruction-tuning datasets. Training was conducted using 8 Nvidia A100 80 GB GPUs with FSDP on AWS Sagemaker, employing parameter-efficient fine-tuning with LoRA.

Overview

Atlas-Chat-9B: Moroccan Darija Language Model

Key Capabilities

Performance Highlights

Training Details

Full Model Card (README)