mesolitica/Malaysian-Qwen2.5-72B-Instruct

TEXT GENERATIONConcurrency Cost:4Model Size:72.7BQuant:FP8Ctx Length:32kArchitecture:Transformer0.0K Cold

The Malaysian-Qwen2.5-72B-Instruct model by mesolitica is a 72.7 billion parameter instruction-tuned language model, fine-tuned from Qwen2.5-72B-Instruct on a 1.5 billion token Malaysian instruction dataset. It is specifically optimized to understand and respond in various Malaysian languages and dialects, including Mandarin, Tamil, and Jawi, across diverse Malaysian contexts. This model excels in multi-turn conversations related to Malaysian legislation, politics, religions, and local languages, offering enhanced performance on the MalayMMLU benchmark compared to its base model.

Loading preview...

Malaysian-Qwen2.5-72B-Instruct Overview

This model is a 72.7 billion parameter instruction-tuned language model developed by mesolitica, building upon the Qwen2.5-72B-Instruct architecture. It has been extensively fine-tuned on a highly curated 1.5 billion token Malaysian instruction dataset to specialize in Malaysian linguistic and cultural contexts.

Key Capabilities and Improvements

  • Multilingual Malaysian Support: Enhanced ability to respond and code in a wide array of Malaysian languages and dialects, including Mandarin, Tamil, Jawi, Manglish, Johor, Kedah, Kelantan, Pahang, Perak, Sabah, Sarawak, Selangor, Negeri Sembilan, and Terengganu.
  • Contextual Understanding: Improved comprehension and generation for multi-turn conversations related to specific Malaysian topics such as legislation, politics, religions, and local languages.
  • Performance Benchmarks: Demonstrates improved accuracy on the MalayMMLU benchmark, with an average accuracy of 79.63% for probability next tokens and 77.29% for first token match, surpassing the original Qwen2.5-72B-Instruct model in these categories.

Training Details

The model was fine-tuned using LoRA on specific layers (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens, lm_head) with a rank of 128 and alpha of 256 or 2.0. It utilized multipacking with an 8192 context length and proper SDPA causal masking to prevent document contamination, alongside Chunk CCE loss for LoRA. The training dataset used was mesolitica/Malaysian-SFT.