ScienceOne-AI/S1-VL-32B
S1-VL-32B is a 33.4 billion parameter multimodal large language model developed by the ScienceOne AI team at the Chinese Academy of Sciences, specifically designed for scientific domains. It supports two distinct reasoning paradigms: Scientific Reasoning for complex, multi-step problem-solving, and Thinking with Images, which enables the model to invoke code tools for image operations during reasoning. This model excels in interpreting dense scientific charts, high-resolution imagery, and microscopic images, achieving state-of-the-art performance across multiple scientific multimodal evaluation benchmarks.
Loading preview...
S1-VL-32B: Scientific Multimodal Reasoning Model
S1-VL-32B, developed by the ScienceOne AI team at the Chinese Academy of Sciences, is a 33.4 billion parameter multimodal large language model optimized for scientific domains. It introduces two core reasoning paradigms to tackle complex scientific challenges:
Key Capabilities
- Scientific Reasoning: Utilizes chain-of-thought processes for analyzing and solving intricate, multi-step scientific problems across disciplines like mathematics, physics, chemistry, astronomy, earth sciences, and biology.
- Thinking with Images: Uniquely enables the model to actively invoke code tools for image operations (e.g., cropping, zooming, enhancement, annotation) during its reasoning process. This is particularly effective for interpreting dense scientific charts, high-resolution remote sensing, microscopic, and astronomical images.
- Cross-disciplinary Data Pipeline: Employs a robust data processing pipeline to ensure high-quality visual reasoning trajectories for training.
- Four-stage Progressive Post-training: A specialized training procedure, including Scientific Reasoning SFT, Thinking-with-Images Cold-Start SFT, and two stages of Reinforcement Learning (RL) using the SAPO algorithm, progressively unlocks and refines its scientific reasoning and image manipulation abilities.
Evaluation Highlights
S1-VL-32B demonstrates strong performance across 13 benchmarks in scientific multimodal reasoning and image manipulation reasoning. It shows significant advantages on authoritative benchmarks like MMMU, MathVision, and VRSBench-MINI, surpassing its base model Qwen3-VL-32B and remaining competitive with much larger open-source models and even closed-source flagship models like Gemini 2.5 Pro and GPT-5. Notably, it ranks first across all five image operation reasoning benchmarks, outperforming models of comparable and larger scales, as well as dedicated "Thinking with Images" models.
Good for
- Academic figure Q&A
- Medical image analysis
- Chemical structure recognition
- Interpreting dense scientific charts and high-resolution imagery
- Tasks requiring image manipulation (cropping, zooming, enhancement) as part of the reasoning process