Name: internlm/Visual-ERM API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: internlm

Visual-ERM: Multimodal Generative Reward Model

Visual-ERM, developed by InternLM, is an 8-billion parameter multimodal generative reward model specifically designed for vision-to-code tasks. Unlike traditional text-based or coarse vision embedding rewards, Visual-ERM directly compares a ground-truth image with a rendered image from a model's prediction. It then generates structured discrepancy annotations that are fine-grained, interpretable, and task-agnostic, providing detailed feedback on visual differences.

Key Capabilities

Visual-space reward modeling: Evaluates predictions by comparing rendered visual outputs, capturing layout, spacing, alignment, and style.
Fine-grained and interpretable feedback: Produces structured annotations (category, severity, location, description) instead of a single score.
Task-agnostic supervision: A unified reward model applicable across various structured visual reconstruction tasks.
Dual utility: Functions as a reward model for Reinforcement Learning (RL) and as a visual critic for test-time reflection and revision.

Good for

Structured visual reconstruction tasks: Including Chart-to-Code, Table-to-Markdown, and SVG-to-Code.
RL training: Providing robust reward signals for multimodal models.
Inference-time refinement: Enabling models to self-correct based on visual discrepancy feedback.
Research: Advancing visual reward modeling and multimodal RL.

Visual-ERM is fine-tuned from Qwen/Qwen3-VL-8B-Instruct and is accompanied by VC-RewardBench, a benchmark dataset for evaluating fine-grained image-to-image discrepancy judgment on structured visual data, covering charts, tables, and SVGs with 1,335 curated instances.

Overview

Visual-ERM: Multimodal Generative Reward Model

Key Capabilities

Good for

Full Model Card (README)