M-Thinker-7B-Iter2: Multilingual Reasoning with Enhanced Consistency

M-Thinker-7B-Iter2 is a 7.6 billion parameter Large Reasoning Model (LRM) developed by Xue Zhang et al., specifically designed to overcome limitations in non-English reasoning tasks. Traditional LRMs often struggle with maintaining input-output language consistency and exhibit lower accuracy in non-English contexts. M-Thinker addresses these issues through a novel training approach.

Key Capabilities and Innovations

Consistency-Enhanced Reinforcement Learning (RL): The model is trained using the GRPO algorithm, which integrates two specialized reward mechanisms:
- Language Consistency (LC) Reward: Ensures strict language consistency between the input, the model's thought process, and the final answer.
- Cross-lingual Thinking Alignment (CTA) Reward: Compares non-English reasoning paths with English reasoning paths to transfer and enhance reasoning capabilities across languages.
Superior Multilingual Performance: Achieves nearly 100% language consistency and demonstrates superior performance on multilingual benchmarks such as MMATH and PolyMath.
Generalization: Exhibits excellent generalization capabilities on out-of-domain languages, making it robust for diverse linguistic applications.
Backbone: Built upon the DeepSeek-R1-Distill-Qwen-7B model.

Ideal Use Cases

Multilingual Reasoning Applications: Particularly effective for complex reasoning tasks in non-English languages where maintaining language consistency and high accuracy are critical.
Global AI Deployment: Suitable for scenarios requiring robust LRM performance for non-English speakers, enhancing user experience and accessibility.

Overview

M-Thinker-7B-Iter2: Multilingual Reasoning with Enhanced Consistency

Key Capabilities and Innovations

Ideal Use Cases

Full Model Card (README)