S1-VL-32B-RL: Scientific Multimodal Reasoning Model

S1-VL-32B-RL, developed by the ScienceOne AI team at the Chinese Academy of Sciences, is a 33.4 billion parameter multimodal large language model optimized for scientific applications. It integrates two core reasoning paradigms: Scientific Reasoning for complex, multi-step problem analysis, and Thinking with Images, which enables the model to invoke code tools for image manipulation (cropping, zooming, enhancement, annotation) during inference.

Key Capabilities

Scientific Multimodal Reasoning: Achieves state-of-the-art performance across diverse scientific benchmarks including MMMU, MathVision, and VRSBench-MINI, covering mathematics, physics, chemistry, astronomy, earth sciences, and biology.
Image Operation Reasoning: Ranks first across five benchmarks (HRBench-4K, HRBench-8K, MME-RealWorld-CN, MME-RealWorld-Lite, V*), demonstrating superior ability in high-resolution image understanding and real-world visual reasoning.
Code Tool Invocation: Can proactively use code to enhance visual information, as demonstrated in case studies involving medical imaging where it crops and magnifies regions of interest for clearer analysis.
Progressive Post-training: Utilizes a four-stage training procedure, including Scientific Reasoning SFT, Thinking-with-Images Cold-Start SFT, and two stages of Reinforcement Learning (RL) with the SAPO algorithm, to progressively unlock and optimize its scientific reasoning and image operation capabilities.

Good For

Analyzing and solving complex scientific problems involving both text and images.
Interpreting dense scientific charts, high-resolution remote sensing, microscopic, and astronomical images.
Applications requiring dynamic image manipulation during the reasoning process, such as medical image analysis and academic figure Q&A.

Overview

S1-VL-32B-RL: Scientific Multimodal Reasoning Model

Key Capabilities

Good For

Full Model Card (README)