ZYao720/WebArbiter-8B-Qwen3
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Apr 8, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

WebArbiter-8B-Qwen3 by ZYao720 is an 8 billion parameter reasoning Process Reward Model (PRM) built on Qwen3-8B, designed for web agents. It excels at evaluating web agent actions by generating structured, principle-guided justifications and preference verdicts, achieving the highest Avg. BoN Acc of 76.66% among WebArbiter variants. This model is primarily used to guide web agent trajectory search and provide interpretable feedback on action choices in web environments.

Loading preview...

WebArbiter-8B-Qwen3: A Principle-Guided Reasoning Process Reward Model

WebArbiter-8B-Qwen3, developed by ZYao720, is an 8 billion parameter Process Reward Model (PRM) for web agents, based on the Qwen3-8B architecture. It is distinguished by its ability to generate structured, interpretable justifications for action preferences, rather than simple scalar rewards. This model achieved the highest Avg. BoN Acc of 76.66% across all WebArbiter variants on the WebPRMBench, outperforming its Qwen2.5-based predecessor.

Key Capabilities

  • Strongest Performance: Achieves 76.66% Avg. BoN Acc, demonstrating superior performance in evaluating web agent actions.
  • Reasoning as Reward: Generates auditable reasoning chains through structured outputs including <State>, <Criteria>, <Analysis>, and <Answer>.
  • Principle-Inducing Evaluation: Dynamically derives evaluation principles based on user intent and the current page state.
  • Two-Stage Training: Utilizes a robust training pipeline involving reasoning distillation (SFT) and RL with Verifiable Rewards (GRPO).
  • Cross-Backbone Generalization: The training pipeline is proven to generalize across different backbone models, including Qwen2.5 and Qwen3.

Intended Uses

  • Evaluating Web Agent Actions: Determines the best action among candidates given a web state and user task.
  • Guiding Trajectory Search: Provides a crucial reward signal for advanced web agent execution strategies like Best-of-N sampling.
  • Interpretable Feedback: Offers clear, structured explanations for action preferences, enhancing transparency and debuggability.