Overview
II-Thought-1.5B-Preview is a 1.5 billion parameter language model developed by Intelligent-Internet, enhanced through Reinforcement Learning (RL). It is built upon the deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B base model and trained using the GRPO algorithm within the ii_thought framework.
Key Capabilities & Training
This preview model was specifically trained on a 50K math subset of the large-scale, multi-task II-Thought-RL-v0 dataset. Its training incorporates a sophisticated reward modeling system that considers both answer correctness and format correctness, aiming to improve the quality and structure of generated responses. The model's maximum context length is 32,768 tokens.
Performance Highlights
II-Thought-1.5B-Preview shows significant improvements in mathematical reasoning benchmarks compared to its base model and Qwen2.5-Math-1.5B-Instruct. It achieves an average score of 49.90% across various math and reasoning tasks, including AMC23 (79.77%), AIME24 (34.17%), Olympiad Bench (52.78%), and Math500 (87.2%). It also demonstrates improved performance on LiveCodeBench (19.84%) and IFEval (44.84%).
Usage Guidelines
For optimal performance, users are recommended to set sampling parameters such as temperature = 0.6 and top_p = 0.95. When tackling mathematical problems, it is advised to explicitly request step-by-step reasoning and to format the final answer within \boxed{}.