Malaysian-Qwen2.5-7B-Instruct Overview
This model is a 7.6 billion parameter instruction-tuned language model developed by mesolitica, building upon the Qwen2.5-7B-Instruct architecture. It has been extensively fine-tuned on a 1.5 billion token Malaysian instruction dataset to enhance its understanding and generation capabilities for Malaysian-specific contexts.
Key Capabilities
- Multilingual and Dialectal Support: The model supports responses and code generation in a wide array of Malaysian languages and dialects, including Mandarin, Tamil, Jawi, Manglish, Johor, Kedah, Kelantan, Pahang, Perak, Sabah, Sarawak, Selangor, Negeri Sembilan, and Terengganu.
- Malaysian Context Understanding: It is specifically trained to handle multi-turn conversations and queries related to Malaysian legislation, politics, religions, and local languages.
- Code Generation: Capable of generating code in the aforementioned Malaysian languages and dialects.
Training Details
The model was fine-tuned using LoRA on the mesolitica/Malaysian-SFT dataset. The training involved multipacking an 8192 context length with SDPA causal masking to prevent document contamination and ensure proper position IDs, alongside Chunk CCE loss for LoRA.
Performance
On the MalayMMLU benchmark (0-shot, first token accuracy), the Malaysian-Qwen2.5-7B-Instruct (revision 83a0e145c726385502898ab7e016982eae1b684d) achieved an average accuracy of 69.26%, outperforming the original Qwen2.5-7B-Instruct's 66.52%.