Light-R1-32B: Advanced Math Reasoning Model
Light-R1-32B, developed by Qihoo360, is a 32.8 billion parameter model fine-tuned from Qwen2.5-32B-Instruct. It is specifically designed to excel in complex mathematical reasoning tasks, particularly those requiring Chain-of-Thought (COT) capabilities, even when starting from models not initially trained with long COT data.
Key Capabilities & Differentiators
- Superior Math Performance: Achieves a score of 76.6 on AIME24 and 64.6 on AIME25, surpassing DeepSeek-R1-Distill-Qwen-32B and other models in its class.
- Cost-Efficient Training: Developed using a curriculum SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) approach, with an estimated training cost of approximately $1000 (6 hours on 12 x H800 machines).
- Transparent & Reproducible: All training datasets (SFT and DPO) and training code based on 360-LLaMA-Factory are open-sourced, providing a validated method for training strong long COT models.
- Data Decontamination: Rigorous decontamination of training data against common reasoning benchmarks like AIME24/25 and MATH-500 to ensure robust and unbiased evaluation.
- Forced Thinking Mechanism: Incorporates a
<think> token in its chat template to explicitly prompt the model for reasoning steps, enhancing its problem-solving process.
Ideal Use Cases
- Advanced Mathematical Problem Solving: Excels in competitive math challenges and complex quantitative analysis.
- Research & Development: Provides a transparent and cost-effective baseline for developing and experimenting with long COT models.
- Educational Tools: Can be integrated into systems requiring high-accuracy mathematical reasoning and step-by-step solutions.