RioLee/ToolRM-Gen-Qwen3-4B-Thinking-2507
RioLee/ToolRM-Gen-Qwen3-4B-Thinking-2507 is a 4 billion parameter generative reward model from the Qwen3 family, developed by RioLee, specifically designed for agentic tool-use scenarios. It excels at pairwise reward judgments and broader critique tasks like Best-of-N sampling and self-correction, outperforming larger LLMs in these specialized evaluations. With a 40960 token context length, this model is optimized for evaluating and improving AI assistant performance in complex tool-use conversations.
Loading preview...
ToolRM-Gen-Qwen3-4B-Thinking-2507: Agentic Tool-Use Reward Model
This model is part of the ToolRM family, a suite of lightweight generative and discriminative reward models specifically engineered for agentic tool-use. Developed by RioLee, this 4 billion parameter model is built on the Qwen3 architecture and is designed to evaluate and critique AI assistant performance in scenarios involving tool utilization.
Key Capabilities
- Pairwise Critique: Conducts thorough comparisons between two generated assistant responses, making a clear choice of the superior option based on specific evidence and evaluation criteria.
- Pointwise Critique: Provides concise critiques on how a single assistant response should be revised, or identifies it as correct.
- Best-of-N Critique: Evaluates multiple assistant responses and selects the best one.
- Tool-Use Evaluation: Specializes in assessing the appropriate and complete leveraging of available tools, validity of tool calls and arguments, and penalizing fabrication or repetitive actions.
- Reinforcement Learning Support: Supports downstream RL training effectively, providing verifiable feedback.
Unique Approach
ToolRM models are trained on the novel ToolPref-Pairwise-30K dataset, constructed using a pipeline of rule-based scoring and multidimensional sampling. Evaluation is performed using TRBench-BFCL, a benchmark built on the agentic evaluation suite BFCL. This model, with its 40960 token context length, has demonstrated superior performance in pairwise reward judgments compared to several larger LLMs.
Usage Notes
- The model was trained with a maximum input length of 16,384 tokens; longer prompts may lead to unpredictable behavior.
- Swapping the order of assistant responses during evaluation is recommended to mitigate position bias.