sensenova/SenseNova-MARS-32B
SenseNova-MARS-32B by sensenova is a 33.4 billion parameter Multimodal Agentic Reasoning and Search (MARS) framework designed for Vision-Language Models (VLMs). It empowers VLMs with interleaved visual reasoning and tool-use capabilities, integrating image search, text search, and image crop tools. Optimized via reinforcement learning with the BN-GSPO algorithm, this model excels at knowledge-intensive and visually complex tasks, achieving state-of-the-art performance on search-oriented and high-resolution image understanding benchmarks.
Loading preview...
SenseNova-MARS-32B Overview
SenseNova-MARS-32B is a 33.4 billion parameter Multimodal Agentic Reasoning and Search (MARS) framework developed by sensenova. It enhances Vision-Language Models (VLMs) by enabling dynamic integration of external tools with continuous reasoning, moving beyond text-oriented chain-of-thought or isolated tool invocation. The model specifically incorporates image search, text search, and image crop tools to address fine-grained and knowledge-intensive visual understanding challenges.
Key Capabilities
- Interleaved Visual Reasoning and Tool-Use: Seamlessly combines visual analysis with dynamic tool manipulation.
- Reinforcement Learning Optimization: Utilizes the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm for stable training and effective tool invocation.
- Advanced Tool Integration: Dynamically uses image search, text search, and image cropping to solve complex visual tasks.
- High Performance: Achieves state-of-the-art results on search-oriented benchmarks like MMSearch (74.3) and HR-MMSearch (54.4), surpassing proprietary models such as Gemini-3-Pro and GPT-5.2 in agentic settings. It also demonstrates strong performance on high-resolution benchmarks, scoring 94.2 on V* Bench and 90.2 on HR-Bench 4K.
Good For
- Applications requiring advanced multimodal agentic reasoning.
- Tasks involving knowledge-intensive and visually complex scenarios.
- Use cases benefiting from dynamic integration of search and image manipulation tools.
- Benchmarking against state-of-the-art VLM performance in search and high-resolution image understanding.