WebArbiter-8B-Qwen3 by ZYao720 is an 8 billion parameter reasoning Process Reward Model (PRM) built on Qwen3-8B, designed for web agents. It excels at evaluating web agent actions by generating structured, principle-guided justifications and preference verdicts, achieving the highest Avg. BoN Acc of 76.66% among WebArbiter variants. This model is primarily used to guide web agent trajectory search and provide interpretable feedback on action choices in web environments.
Loading preview...
WebArbiter-8B-Qwen3: A Principle-Guided Reasoning Process Reward Model
WebArbiter-8B-Qwen3, developed by ZYao720, is an 8 billion parameter Process Reward Model (PRM) for web agents, based on the Qwen3-8B architecture. It is distinguished by its ability to generate structured, interpretable justifications for action preferences, rather than simple scalar rewards. This model achieved the highest Avg. BoN Acc of 76.66% across all WebArbiter variants on the WebPRMBench, outperforming its Qwen2.5-based predecessor.
Key Capabilities
- Strongest Performance: Achieves 76.66% Avg. BoN Acc, demonstrating superior performance in evaluating web agent actions.
- Reasoning as Reward: Generates auditable reasoning chains through structured outputs including
<State>,<Criteria>,<Analysis>, and<Answer>. - Principle-Inducing Evaluation: Dynamically derives evaluation principles based on user intent and the current page state.
- Two-Stage Training: Utilizes a robust training pipeline involving reasoning distillation (SFT) and RL with Verifiable Rewards (GRPO).
- Cross-Backbone Generalization: The training pipeline is proven to generalize across different backbone models, including Qwen2.5 and Qwen3.
Intended Uses
- Evaluating Web Agent Actions: Determines the best action among candidates given a web state and user task.
- Guiding Trajectory Search: Provides a crucial reward signal for advanced web agent execution strategies like Best-of-N sampling.
- Interpretable Feedback: Offers clear, structured explanations for action preferences, enhancing transparency and debuggability.