SpatialReward/SpatialReward-8B

VISIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:May 5, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

SpatialReward-8B is an 8 billion parameter reward model developed by SpatialReward, designed for instruction-guided image editing. It addresses the "Attention Collapse" problem by incorporating explicit spatial reasoning, anchoring semantic judgments to predicted edit regions via bounding boxes. This model achieves state-of-the-art performance as both an evaluator and an RL training signal for image editing tasks, outperforming larger models like GPT-4.1 and GPT-5 on the MER-Bench benchmark.

Loading preview...

What is SpatialReward-8B?

SpatialReward-8B is an 8 billion parameter reward model specifically engineered for instruction-guided image editing. Developed by SpatialReward, this model tackles the critical "Attention Collapse" problem, where traditional reward models often fail to accurately evaluate fine-grained editing details and cross-image comparisons.

Key Capabilities & Innovations

  • Explicit Spatial Reasoning: SpatialReward-8B predicts bounding boxes for edited regions, grounding semantic judgments to pixel-level evidence. This enhances accuracy and reliability in evaluating image edits.
  • MER-Bench Benchmark: The model introduces and excels on MER-Bench, a new benchmark featuring multi-edit scenarios and expert human annotations, demonstrating superior performance over models like GPT-4.1 and GPT-5.
  • Stable RL Training Signal: It serves as an effective and stable reward signal for online Reinforcement Learning (RL) in image editing, facilitating advancements in automated image manipulation.
  • Open-Sourced Resources: Alongside the model weights, SpatialReward has open-sourced the MER-Bench evaluation benchmark and the SpatialReward-Train dataset (260k spatial-aware training data).

Why is it Different?

Unlike many existing reward models, SpatialReward-8B's core differentiator is its explicit spatial grounding. By understanding and localizing the specific areas of an image that have been edited, it provides more precise and reliable feedback, preventing the "Attention Collapse" that hinders other models in complex image editing tasks. This makes it particularly effective for scenarios requiring detailed, localized changes based on natural language instructions.