WebArbiter-3B by ZYao720 is a 3.1 billion parameter reasoning Process Reward Model (PRM) for web agents, built on Qwen2.5-3B-Instruct, with a 32768 token context length. It specializes in generating interpretable, principle-inducing justifications for web agent actions, concluding with a preference verdict. This model achieves an Avg. BoN Acc of 59.06% on WEBPRMBENCH, outperforming larger open-source LLM-as-judge baselines and previous SOTA WebPRMs. It is designed for evaluating web agent actions, guiding trajectory search, and providing structured feedback in resource-constrained environments.
Loading preview...
WebArbiter-3B: Principle-Guided Reasoning for Web Agents
WebArbiter-3B, developed by ZYao720, is a 3.1 billion parameter Process Reward Model (PRM) specifically designed for web agents. Built upon the Qwen2.5-3B-Instruct architecture, it distinguishes itself by formulating step-level reward modeling as structured text generation, providing interpretable justifications rather than simple scalar scores.
Key Capabilities & Features
- Reasoning as Reward: Generates structured outputs including
<State>,<Criteria>,<Analysis>, and<Answer>, offering auditable reasoning chains for action preferences. - Principle-Inducing Evaluation: Dynamically derives evaluation principles from user intent and page state, enhancing robustness across diverse web environments.
- Two-Stage Training: Utilizes reasoning distillation from an o3 teacher followed by Reinforcement Learning with Verifiable Rewards (GRPO) to refine verdicts and align with ground-truth correctness.
- Strong Performance: Achieves an Avg. BoN Acc of 59.06% on the WEBPRMBENCH benchmark, surpassing the previous 3B SOTA WebPRM by 15.5 points and outperforming open-source LLM-as-judge baselines up to 70B parameters.
- Efficiency: Despite its compact size, it demonstrates performance superior to larger models like WebShepherd-8B, making it suitable for resource-constrained deployment.
Intended Uses
- Evaluating Web Agent Actions: Determines which of two candidate actions better advances a user's task given a web state.
- Guiding Trajectory Search: Provides a crucial reward signal for Best-of-N sampling or tree search mechanisms in web agent execution.
- Interpretable Feedback: Offers structured, human-readable justifications for action preferences, aiding in debugging and analysis of web agent behavior.
Limitations
WebArbiter-3B operates on text-only accessibility tree representations, potentially missing visual cues. It is currently English-only and may exhibit a safe-action bias or occasional element reference hallucination.