Overview
mesolitica/Malaysian-Qwen2.5-7B-Dialect-Reasoning-GRPO is a 7.6 billion parameter Qwen 2.5 model developed by mesolitica. It is fine-tuned using online Reinforcement Learning with GRPO (Generalized Reinforcement Policy Optimization) on a highly curated Malay Dialect Reasoning dataset. The model's training involved replicating each datapoint to 6 generations to enhance reasoning across dialects.
Key Capabilities
- Dialect Reasoning: Significantly improves reasoning capabilities within and across various Malay dialects.
- Dialect Translation: Demonstrates proficiency in translating between specific Malay dialects (e.g., Johor, Kedah, Kelantan) and standard Malay.
- Reinforcement Learning: Leverages online GRPO with full parameter updates for enhanced performance.
Performance
The model was evaluated using vLLM with sacrebleu CHRF max@5 scores. It achieved an average score of 56.82% for dialect-to-standard Malay translation and 58.11% for standard Malay-to-dialect translation in Float32 precision. Similar performance was observed in Float16 precision, with average scores of 57.27% and 57.44% respectively.
Recommended Usage
For optimal reasoning performance, users are advised to employ a specific system prompt: You are going to enter reasoning mode. First, you try to think step-by-step in Malay. After that, put your final answer within $\boxed{}$. This prompt guides the model to perform step-by-step reasoning in Malay before providing a final, boxed answer.