WebArbiter-4B-Qwen3: A Principle-Guided Reasoning Process Reward Model
WebArbiter-4B-Qwen3, developed by ZYao720, is a 4 billion parameter reasoning Process Reward Model (PRM) specifically designed for web agents. Built upon the Qwen3-4B architecture, this model excels at evaluating web agent actions by generating structured, interpretable justifications.
Key Capabilities
- Parameter-efficient performance: Achieves an Avg. BoN Acc of 72.55% on WebPRMBench, closely approaching the performance of 7B parameter models with roughly half the size.
- Reasoning as reward: Formulates step-level reward modeling as structured text generation, producing auditable reasoning chains with
<State>, <Criteria>, <Analysis>, and <Answer> outputs. - Principle-inducing evaluation: Dynamically derives evaluation principles based on user intent and the current page state.
- Two-stage training: Utilizes reasoning distillation from an o3 teacher (SFT) followed by Reinforcement Learning with Verifiable Rewards (GRPO).
Good For
- Evaluating web agent actions: Determines which of two candidate actions better advances a user's task given a web state.
- Guiding trajectory search: Provides a robust reward signal for techniques like Best-of-N sampling or tree search in web agent execution.
- Interpretable feedback: Generates clear, structured explanations for action preferences, enhancing transparency and debugging.
- Resource-efficient deployment: Offers strong performance at a 4B parameter scale, making it suitable for environments with computational constraints.