MoR-M1-Qwen2.5-0.6a-0.4F: Moroccan Darija Causal Language Model
Developed by Oit Lab (OIT Technologies), MoR-M1-Qwen2.5-0.6a-0.4F is a 387-million parameter causal language model from the MoR-M1 Series, focused on advancing low-resource NLP for Moroccan Arabic (Darija). Built on the Qwen2.5 transformer architecture, it features a custom 32K BPE tokenizer trained exclusively on Darija text, leading to significantly improved token efficiency and compression compared to standard Arabic tokenizers.
Key Capabilities & Features
- Darija-Optimized Tokenization: Custom BPE tokenizer achieves ~130% relative gain in characters per token compared to mBERT, resulting in ~56% shorter sequences.
- Efficient Long Context: Improved token efficiency allows for better utilization of its 32,768-token context window.
- Compact & Specialized: A 387M-parameter model designed for research and deployment under compute constraints, offering full dialect specialization.
- Stable Training: Trained on 60.78 MB of Darija text over 5 epochs, achieving a final loss of ~0.63 and perplexity of ~1.88.
Use Cases & Limitations
This model is ideal for applications requiring deep linguistic understanding and generation in Moroccan Darija, particularly in research and fine-tuning scenarios where computational resources are limited. Its high token efficiency makes it suitable for tasks benefiting from longer context windows. However, it is important to note that the model is optimized for Darija, not Modern Standard Arabic, and has not undergone instruction tuning or safety alignment. It is not intended for high-risk or safety-critical applications due to potential biases from its scraped web data training corpus.