Model Overview
XueZhang-bjtu/1.5B-cold-start-SFT is a 1.5 billion parameter supervised fine-tuned (SFT) model, serving as the initial backbone for the M-Thinker project. It is built upon the DeepSeek-R1-Distill-Qwen-1.5B architecture and fine-tuned using the M-Thinker-SFT-data. This model is a crucial component in the development of Large Reasoning Models (LRMs) that aim to overcome limitations in processing non-English languages, particularly regarding input-output language consistency and reasoning accuracy.
Key Characteristics
- Foundation Model: Acts as a 'cold-start' SFT model, providing a strong base for subsequent reinforcement learning (RL) stages, such as those employing the GRPO algorithm with Language Consistency (LC) and Cross-lingual Thinking Alignment (CTA) rewards.
- Multilingual Reasoning Focus: While this specific model is an SFT base, its development is geared towards improving multilingual reasoning, addressing issues like language consistency and performance degradation in non-English contexts.
- Training Data: Fine-tuned with the M-Thinker-SFT-data, preparing it for more advanced multilingual reasoning tasks.
Intended Use Cases
- Base for Multilingual LRM Development: Ideal for researchers and developers looking for a foundational model to build and experiment with advanced multilingual reasoning capabilities.
- Exploration of RL for Language Consistency: Suitable for those interested in applying reinforcement learning techniques to enhance language consistency and cross-lingual reasoning alignment in LLMs.