sensenova/SenseNova-MARS-8B

VISIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Dec 26, 2025License:mitArchitecture:Transformer0.0K Open Weights Cold

SenseNova-MARS-8B is an 8 billion parameter multimodal agentic reasoning and search framework developed by SenseNova. This VLM integrates image search, text search, and image cropping tools, excelling at complex visual understanding tasks that require coordinated external tool use. It demonstrates strong performance on search-oriented benchmarks like MMSearch and HR-MMSearch, often surpassing proprietary models in agentic capabilities.

Loading preview...

SenseNova-MARS-8B: Multimodal Agentic Reasoning and Search

SenseNova-MARS-8B is an 8 billion parameter Vision-Language Model (VLM) designed to enhance agentic reasoning and tool-use capabilities through reinforcement learning. Unlike traditional VLMs that primarily rely on text-oriented chain-of-thought, SenseNova-MARS dynamically integrates image search, text search, and image cropping tools to address knowledge-intensive and visually complex scenarios.

Key Capabilities

  • Interleaved Visual Reasoning and Tool-Use: Seamlessly combines visual understanding with dynamic tool manipulation.
  • Integrated Tool Suite: Leverages text search, image search, and image crop tools for fine-grained visual analysis.
  • Reinforcement Learning Optimization: Utilizes the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm for stable training and effective tool invocation.
  • High-Resolution Image Handling: Evaluated on the HR-MMSearch benchmark, specifically designed for high-resolution images and knowledge-intensive, search-driven questions.

Performance Highlights

SenseNova-MARS-8B achieves competitive performance on search-oriented benchmarks. In agentic model evaluations, it scores 67.84 on MMSearch and 41.64 on HR-MMSearch, demonstrating robust capabilities in complex visual tasks requiring external tools. The larger SenseNova-MARS-32B variant even surpasses proprietary models like Gemini-3-Pro and GPT-5.2 on these benchmarks.

Good for

  • Applications requiring advanced multimodal reasoning with dynamic tool integration.
  • Tasks involving knowledge-intensive visual understanding and search.
  • Scenarios demanding coordinated use of image search, text search, and image cropping.