internlm/Visual-ERM

VISIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Mar 12, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

Visual-ERM by InternLM is an 8-billion parameter multimodal generative reward model, fine-tuned from Qwen3-VL-8B-Instruct, designed for vision-to-code tasks. It uniquely evaluates outputs by comparing ground-truth and rendered images in the visual space, generating fine-grained, interpretable, and task-agnostic discrepancy feedback. This model excels at identifying visual differences in structured visual reconstruction tasks like Chart-to-Code and Table-to-Markdown, serving as both a reward model for reinforcement learning and a visual critic for test-time refinement.

Loading preview...

Visual-ERM: Multimodal Generative Reward Model

Visual-ERM, developed by InternLM, is an 8-billion parameter multimodal generative reward model specifically designed for vision-to-code tasks. Unlike traditional text-based or coarse vision embedding rewards, Visual-ERM directly compares a ground-truth image with a rendered image from a model's prediction. It then generates structured discrepancy annotations that are fine-grained, interpretable, and task-agnostic, providing detailed feedback on visual differences.

Key Capabilities

  • Visual-space reward modeling: Evaluates predictions by comparing rendered visual outputs, capturing layout, spacing, alignment, and style.
  • Fine-grained and interpretable feedback: Produces structured annotations (category, severity, location, description) instead of a single score.
  • Task-agnostic supervision: A unified reward model applicable across various structured visual reconstruction tasks.
  • Dual utility: Functions as a reward model for Reinforcement Learning (RL) and as a visual critic for test-time reflection and revision.

Good for

  • Structured visual reconstruction tasks: Including Chart-to-Code, Table-to-Markdown, and SVG-to-Code.
  • RL training: Providing robust reward signals for multimodal models.
  • Inference-time refinement: Enabling models to self-correct based on visual discrepancy feedback.
  • Research: Advancing visual reward modeling and multimodal RL.

Visual-ERM is fine-tuned from Qwen/Qwen3-VL-8B-Instruct and is accompanied by VC-RewardBench, a benchmark dataset for evaluating fine-grained image-to-image discrepancy judgment on structured visual data, covering charts, tables, and SVGs with 1,335 curated instances.