ZYao720/WebArbiter-7B is a 7.6 billion parameter reasoning Process Reward Model (PRM) for web agents, built on Qwen2.5-7B-Instruct. It formulates step-level reward modeling as structured text generation, providing interpretable, principle-inducing justifications for action preferences. This model achieves an Avg. BoN Acc of 74.60% on WEBPRMBENCH, significantly outperforming GPT-5 and previous state-of-the-art WebPRMs. It is designed to evaluate web agent actions, guide trajectory search, and provide structured feedback for web automation tasks.
Loading preview...
WebArbiter-7B: A Principle-Guided Reasoning Process Reward Model
WebArbiter-7B is a 7.6 billion parameter Process Reward Model (PRM) for web agents, developed by ZYao720 and based on Qwen2.5-7B-Instruct. Unlike traditional scalar or checklist-based reward models, WebArbiter-7B generates structured text outputs, including <State>, <Criteria>, <Analysis>, and <Answer>, to provide auditable reasoning chains for its preference verdicts. This approach allows for dynamic derivation of evaluation principles from user intent and page state, enhancing robustness and generalization across diverse web environments.
Key Capabilities
- Structured Reasoning: Provides interpretable, step-level evaluations with explicit reasoning, making its decisions transparent and debuggable.
- Superior Performance: Achieves an Avg. BoN Acc of 74.60% on the WEBPRMBENCH benchmark, surpassing GPT-5 by 9.1 points and the prior SOTA WebShepherd-8B by 31 points.
- Robust Generalization: Demonstrates state-of-the-art performance across various WebPRMBench environments, including out-of-domain enterprise workflows (WorkArena) and open-world websites (AssistantBench).
- Reward-Guided Trajectory Search: Significantly improves success rates in reward-guided trajectory search on WebArena-Lite, outperforming WebShepherd-8B by up to 6.4 points.
- Two-Stage Training: Utilizes reasoning distillation from a teacher model followed by Reinforcement Learning with Verifiable Rewards (GRPO) to refine judgments and align with ground-truth correctness.
Good For
- Evaluating Web Agent Actions: Determining which of two candidate actions better advances a user's task in a given web state.
- Guiding Web Agent Trajectory Search: Serving as a robust reward signal for Best-of-N sampling or tree search mechanisms in web automation.
- Interpretable Feedback: Generating detailed, structured justifications for action preferences, aiding in debugging and analysis of web agent behavior.