Name: ZYao720/WebArbiter-4B-Qwen3 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: ZYao720

WebArbiter-4B-Qwen3: A Principle-Guided Reasoning Process Reward Model

WebArbiter-4B-Qwen3, developed by ZYao720, is a 4 billion parameter reasoning Process Reward Model (PRM) specifically designed for web agents. Built upon the Qwen3-4B architecture, this model excels at evaluating web agent actions by generating structured, interpretable justifications.

Key Capabilities

Parameter-efficient performance: Achieves an Avg. BoN Acc of 72.55% on WebPRMBench, closely approaching the performance of 7B parameter models with roughly half the size.
Reasoning as reward: Formulates step-level reward modeling as structured text generation, producing auditable reasoning chains with <State>, <Criteria>, <Analysis>, and <Answer> outputs.
Principle-inducing evaluation: Dynamically derives evaluation principles based on user intent and the current page state.
Two-stage training: Utilizes reasoning distillation from an o3 teacher (SFT) followed by Reinforcement Learning with Verifiable Rewards (GRPO).

Good For

Evaluating web agent actions: Determines which of two candidate actions better advances a user's task given a web state.
Guiding trajectory search: Provides a robust reward signal for techniques like Best-of-N sampling or tree search in web agent execution.
Interpretable feedback: Generates clear, structured explanations for action preferences, enhancing transparency and debugging.
Resource-efficient deployment: Offers strong performance at a 4B parameter scale, making it suitable for environments with computational constraints.

Overview

WebArbiter-4B-Qwen3: A Principle-Guided Reasoning Process Reward Model

Key Capabilities

Good For

Full Model Card (README)