XueZhang-bjtu/M-Thinker-1.5B-Iter1
M-Thinker-1.5B-Iter1 by XueZhang-bjtu is a 1.5 billion parameter large reasoning model (LRM) with a 32768 token context length, specifically designed to enhance multilingual reasoning capabilities. It addresses limitations in language consistency and reasoning accuracy for non-English languages through a consistency-enhanced reinforcement learning approach. This model excels at maintaining input-output language consistency and improving reasoning performance across various languages, making it suitable for complex multilingual reasoning tasks.
Loading preview...
M-Thinker-1.5B-Iter1: Multilingual Reasoning with Enhanced Consistency
M-Thinker-1.5B-Iter1, developed by XueZhang-bjtu, is a 1.5 billion parameter Large Reasoning Model (LRM) that focuses on overcoming common limitations of LRMs in non-English languages. Traditional LRMs often struggle with maintaining language consistency across input, thought, and answer, and exhibit lower reasoning accuracy in non-English contexts compared to English.
Key Capabilities and Innovations
- Enhanced Multilingual Reasoning: M-Thinker is trained using the GRPO algorithm, incorporating a novel Language Consistency (LC) reward and a Cross-lingual Thinking Alignment (CTA) reward.
- Near-Perfect Language Consistency: The LC reward enforces strict language consistency between input, thought, and answer, aiming for nearly 100% consistency.
- Cross-lingual Reasoning Transfer: The CTA reward facilitates the transfer of the model's English reasoning capabilities to non-English languages by comparing reasoning paths.
- Superior Performance: Through an iterative reinforcement learning procedure, M-Thinker models demonstrate improved performance on multilingual benchmarks like MMATH and PolyMath.
- Generalization to Out-of-Domain Languages: The model exhibits strong generalization capabilities to languages not explicitly seen during training.
Ideal Use Cases
- Multilingual Reasoning Tasks: Particularly effective for complex reasoning problems requiring consistent language use across different stages of thought.
- Applications Requiring High Language Consistency: Suitable for scenarios where maintaining the input language throughout the reasoning process is critical.
- Global Deployment of LRMs: Designed to improve the user experience for non-English speakers by enhancing reasoning accuracy and consistency.