Overview
CUAIStudents/Qwen-Ar-GEC is a specialized 7.6 billion parameter language model built upon the Qwen 2.5-7B-Instruct architecture. Developed by CUAIStudents, its primary function is Arabic Grammatical Error Correction (GEC), including spelling and stylistic error detection. The model was fine-tuned using the QLoRA method on 50,000 samples from the CUAIStudents/Arabic-Tashkeel dataset, with a focus on adding full diacritization (tashkeel) to Arabic text.
Key Capabilities
- Arabic GEC: Automatically identifies and corrects grammatical, spelling, and stylistic errors in Arabic text.
- Full Diacritization: Ensures complete vocalization (tashkeel) of all characters, even if the input text is grammatically correct.
- Non-altering Correction: Designed to correct errors without changing words, names, numbers, sentence structure, or the original meaning of the text.
- Qwen 2.5-7B-Instruct Base: Leverages the robust capabilities of the Qwen 2.5-7B-Instruct model as its foundation.
Good For
- Automated Proofreading: Ideal for applications requiring automatic correction of Arabic documents.
- Language Learning: Supports learners by providing accurate grammatical feedback and full diacritization.
- Academic Writing: Assists in producing high-quality, error-free Arabic academic content.
- Text Normalization: Useful for tasks that require standardized and fully diacritized Arabic text.
Limitations
The model's training dataset has limited coverage of Modern Standard Arabic (MSA), and it was trained on only 50,000 samples due to hardware constraints, despite a larger dataset being available.