Model Overview
Jarrodbarnes/Qwen3-4B-tau2-grpo-v1 is a 4 billion parameter model built on the Qwen3-4B-Instruct base, specifically optimized for multi-turn tool-use tasks. It demonstrates a 4x improvement over the base model in agentic capabilities, achieving 59% Pass@4 on the challenging tau2-bench test split.
Key Capabilities & Training
This model's advanced performance stems from a progressive three-stage training pipeline:
- SFT (Supervised Fine-Tuning): Initial learning of tool schemas and interaction protocols.
- RFT (Rejection Fine-Tuning): Focusing on high-quality interaction trajectories.
- GRPO (Group Relative Policy Optimization): Reinforcement learning with turn-level reward shaping for complex multi-step reasoning.
This methodology enables the model to effectively handle sequential function calls and complex agent workflows, as detailed in the tau2 training cookbook.
Performance Highlights
On the tau2-bench test split, the model achieves:
- Overall Pass@4: 59.0%
- Overall Pass@1: 36.0%
This significantly surpasses the baseline Qwen3-4B-Instruct, which scored 14.3% Pass@4, showcasing the effectiveness of the GRPO fine-tuning for agentic tasks.
Use Cases
This model is particularly well-suited for applications requiring:
- Multi-turn function calling: Executing a sequence of tool interactions to complete a complex task.
- Agentic workflows: Building AI agents that can reason and act over multiple steps.
- Automated task completion: Handling structured interactions in domains like retail and airline services, though telecom tasks remain more challenging.
Limitations
Users should note challenges in the Telecom domain (40% Pass@4) and sensitivity to the user simulator used during evaluation. The reported Pass@k metric differs from the Pass^k used on the official tau2-bench leaderboard.