M-Thinker-1.5B-Iter2: Multilingual Reasoning Model
M-Thinker-1.5B-Iter2 is a 1.5 billion parameter model developed by Xue Zhang et al., specifically engineered to overcome common limitations of Large Reasoning Models (LRMs) in non-English languages. Traditional LRMs often struggle with maintaining language consistency across input, thought, and answer, and exhibit lower accuracy on non-English reasoning paths compared to English.
Key Innovations
This model introduces a novel training approach using the GRPO algorithm with two distinct reward mechanisms:
- Language Consistency (LC) Reward: Enforces strict adherence to language consistency between the input, the model's thought process, and the final answer.
- Cross-lingual Thinking Alignment (CTA) Reward: Transfers the model's English reasoning capabilities to non-English languages by comparing non-English reasoning paths with their English counterparts.
Performance and Capabilities
Through an iterative Reinforcement Learning (RL) procedure, M-Thinker-1.5B-Iter2 achieves nearly 100% language consistency and demonstrates superior performance on multilingual benchmarks such as MMATH and PolyMath. It also exhibits strong generalization to out-of-domain languages, making it a robust solution for global deployment of reasoning-focused AI applications.
Use Cases
- Multilingual Reasoning Tasks: Ideal for applications requiring complex problem-solving and logical deduction in various languages.
- Cross-lingual AI Systems: Enhances the reliability and accuracy of AI systems operating in diverse linguistic environments.
- Improved User Experience: Provides a more consistent and accurate reasoning experience for non-English speakers.