internlm/Visual-ERM
Visual-ERM by InternLM is an 8-billion parameter multimodal generative reward model, fine-tuned from Qwen3-VL-8B-Instruct, designed for vision-to-code tasks. It uniquely evaluates outputs by comparing ground-truth and rendered images in the visual space, generating fine-grained, interpretable, and task-agnostic discrepancy feedback. This model excels at identifying visual differences in structured visual reconstruction tasks like Chart-to-Code and Table-to-Markdown, serving as both a reward model for reinforcement learning and a visual critic for test-time refinement.
Loading preview...
Visual-ERM: Multimodal Generative Reward Model
Visual-ERM, developed by InternLM, is an 8-billion parameter multimodal generative reward model specifically designed for vision-to-code tasks. Unlike traditional text-based or coarse vision embedding rewards, Visual-ERM directly compares a ground-truth image with a rendered image from a model's prediction. It then generates structured discrepancy annotations that are fine-grained, interpretable, and task-agnostic, providing detailed feedback on visual differences.
Key Capabilities
- Visual-space reward modeling: Evaluates predictions by comparing rendered visual outputs, capturing layout, spacing, alignment, and style.
- Fine-grained and interpretable feedback: Produces structured annotations (category, severity, location, description) instead of a single score.
- Task-agnostic supervision: A unified reward model applicable across various structured visual reconstruction tasks.
- Dual utility: Functions as a reward model for Reinforcement Learning (RL) and as a visual critic for test-time reflection and revision.
Good for
- Structured visual reconstruction tasks: Including Chart-to-Code, Table-to-Markdown, and SVG-to-Code.
- RL training: Providing robust reward signals for multimodal models.
- Inference-time refinement: Enabling models to self-correct based on visual discrepancy feedback.
- Research: Advancing visual reward modeling and multimodal RL.
Visual-ERM is fine-tuned from Qwen/Qwen3-VL-8B-Instruct and is accompanied by VC-RewardBench, a benchmark dataset for evaluating fine-grained image-to-image discrepancy judgment on structured visual data, covering charts, tables, and SVGs with 1,335 curated instances.