ToolRM-Gen-Qwen3-4B-Thinking-2507: Agentic Tool-Use Reward Model
This model is part of the ToolRM family, a suite of lightweight generative and discriminative reward models specifically engineered for agentic tool-use. Developed by RioLee, this 4 billion parameter model is built on the Qwen3 architecture and is designed to evaluate and critique AI assistant performance in scenarios involving tool utilization.
Key Capabilities
- Pairwise Critique: Conducts thorough comparisons between two generated assistant responses, making a clear choice of the superior option based on specific evidence and evaluation criteria.
- Pointwise Critique: Provides concise critiques on how a single assistant response should be revised, or identifies it as correct.
- Best-of-N Critique: Evaluates multiple assistant responses and selects the best one.
- Tool-Use Evaluation: Specializes in assessing the appropriate and complete leveraging of available tools, validity of tool calls and arguments, and penalizing fabrication or repetitive actions.
- Reinforcement Learning Support: Supports downstream RL training effectively, providing verifiable feedback.
Unique Approach
ToolRM models are trained on the novel ToolPref-Pairwise-30K dataset, constructed using a pipeline of rule-based scoring and multidimensional sampling. Evaluation is performed using TRBench-BFCL, a benchmark built on the agentic evaluation suite BFCL. This model, with its 40960 token context length, has demonstrated superior performance in pairwise reward judgments compared to several larger LLMs.
Usage Notes
- The model was trained with a maximum input length of 16,384 tokens; longer prompts may lead to unpredictable behavior.
- Swapping the order of assistant responses during evaluation is recommended to mitigate position bias.