Name: ZYao720/WebArbiter-8B-Qwen3 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: ZYao720

WebArbiter-8B-Qwen3: A Principle-Guided Reasoning Process Reward Model

WebArbiter-8B-Qwen3, developed by ZYao720, is an 8 billion parameter Process Reward Model (PRM) for web agents, based on the Qwen3-8B architecture. It is distinguished by its ability to generate structured, interpretable justifications for action preferences, rather than simple scalar rewards. This model achieved the highest Avg. BoN Acc of 76.66% across all WebArbiter variants on the WebPRMBench, outperforming its Qwen2.5-based predecessor.

Key Capabilities

Strongest Performance: Achieves 76.66% Avg. BoN Acc, demonstrating superior performance in evaluating web agent actions.
Reasoning as Reward: Generates auditable reasoning chains through structured outputs including <State>, <Criteria>, <Analysis>, and <Answer>.
Principle-Inducing Evaluation: Dynamically derives evaluation principles based on user intent and the current page state.
Two-Stage Training: Utilizes a robust training pipeline involving reasoning distillation (SFT) and RL with Verifiable Rewards (GRPO).
Cross-Backbone Generalization: The training pipeline is proven to generalize across different backbone models, including Qwen2.5 and Qwen3.

Intended Uses

Evaluating Web Agent Actions: Determines the best action among candidates given a web state and user task.
Guiding Trajectory Search: Provides a crucial reward signal for advanced web agent execution strategies like Best-of-N sampling.
Interpretable Feedback: Offers clear, structured explanations for action preferences, enhancing transparency and debuggability.

Overview

WebArbiter-8B-Qwen3: A Principle-Guided Reasoning Process Reward Model

Key Capabilities

Intended Uses

Full Model Card (README)