QCRI/Fanar-1-9B-Instruct
Overview
Fanar-1-9B-Instruct: A Bilingual Arabic-English LLM
Fanar-1-9B-Instruct is an instruction-tuned large language model developed by the Qatar Computing Research Institute (QCRI) at Hamad Bin Khalifa University (HBKU). It is based on the google/gemma-2-9b architecture and has been continually pretrained on 1 trillion tokens, with a balanced focus on Arabic (410B tokens) and English (515B tokens), plus 102B code tokens. A key differentiator is its specific attention to the richness of the Arabic language, supporting Modern Standard Arabic and various dialects including Gulf, Levantine, and Egyptian. The model's training data and instruction-tuning phases (4.5M SFT instructions, 250K DPO preference pairs) ensure alignment with Islamic values and Arab cultures.
Key Capabilities
- Bilingual Proficiency: Excels in both Arabic and English language understanding and generation.
- Arabic Dialect Support: Handles Modern Standard Arabic and a diverse set of Arabic dialects.
- Cultural Alignment: Designed to align with Islamic values and Arab cultures through meticulous data curation.
- Instruction-Tuned: Optimized for conversational and instruction-following tasks.
Good For
- Developing conversational agents that operate in Arabic, English, or bilingual contexts.
- Cultural and dialectal question answering specific to the Arab world.
- Educational, governmental, and civic NLP applications targeting Arabic-speaking audiences.
- Research in Arabic natural language generation and understanding.