M-Thinker-1.5B-Iter2: Multilingual Reasoning Model

M-Thinker-1.5B-Iter2 is a 1.5 billion parameter model developed by Xue Zhang et al., specifically engineered to overcome common limitations of Large Reasoning Models (LRMs) in non-English languages. Traditional LRMs often struggle with maintaining language consistency across input, thought, and answer, and exhibit lower accuracy on non-English reasoning paths compared to English.

Key Innovations

This model introduces a novel training approach using the GRPO algorithm with two distinct reward mechanisms:

Language Consistency (LC) Reward: Enforces strict adherence to language consistency between the input, the model's thought process, and the final answer.
Cross-lingual Thinking Alignment (CTA) Reward: Transfers the model's English reasoning capabilities to non-English languages by comparing non-English reasoning paths with their English counterparts.

Performance and Capabilities

Through an iterative Reinforcement Learning (RL) procedure, M-Thinker-1.5B-Iter2 achieves nearly 100% language consistency and demonstrates superior performance on multilingual benchmarks such as MMATH and PolyMath. It also exhibits strong generalization to out-of-domain languages, making it a robust solution for global deployment of reasoning-focused AI applications.

Use Cases

Multilingual Reasoning Tasks: Ideal for applications requiring complex problem-solving and logical deduction in various languages.
Cross-lingual AI Systems: Enhances the reliability and accuracy of AI systems operating in diverse linguistic environments.
Improved User Experience: Provides a more consistent and accurate reasoning experience for non-English speakers.

Overview

M-Thinker-1.5B-Iter2: Multilingual Reasoning Model

Key Innovations

Performance and Capabilities

Use Cases

Full Model Card (README)