ZYao720/WebArbiter-4B-Qwen3

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Apr 8, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

WebArbiter-4B-Qwen3 is a 4 billion parameter reasoning Process Reward Model (PRM) for web agents, developed by ZYao720 and built on Qwen3-4B. It formulates step-level reward modeling as structured text generation, providing interpretable, principle-inducing justifications for web agent actions. This model achieves an Avg. BoN Acc of 72.55% on WebPRMBench, demonstrating strong performance for evaluating and guiding web agent trajectories with roughly half the parameters of larger alternatives.

Loading preview...

WebArbiter-4B-Qwen3: A Principle-Guided Reasoning Process Reward Model

WebArbiter-4B-Qwen3, developed by ZYao720, is a 4 billion parameter reasoning Process Reward Model (PRM) specifically designed for web agents. Built upon the Qwen3-4B architecture, this model excels at evaluating web agent actions by generating structured, interpretable justifications.

Key Capabilities

  • Parameter-efficient performance: Achieves an Avg. BoN Acc of 72.55% on WebPRMBench, closely approaching the performance of 7B parameter models with roughly half the size.
  • Reasoning as reward: Formulates step-level reward modeling as structured text generation, producing auditable reasoning chains with <State>, <Criteria>, <Analysis>, and <Answer> outputs.
  • Principle-inducing evaluation: Dynamically derives evaluation principles based on user intent and the current page state.
  • Two-stage training: Utilizes reasoning distillation from an o3 teacher (SFT) followed by Reinforcement Learning with Verifiable Rewards (GRPO).

Good For

  • Evaluating web agent actions: Determines which of two candidate actions better advances a user's task given a web state.
  • Guiding trajectory search: Provides a robust reward signal for techniques like Best-of-N sampling or tree search in web agent execution.
  • Interpretable feedback: Generates clear, structured explanations for action preferences, enhancing transparency and debugging.
  • Resource-efficient deployment: Offers strong performance at a 4B parameter scale, making it suitable for environments with computational constraints.