Jarrodbarnes/Qwen3-4B-tau2-grpo-v1
Jarrodbarnes/Qwen3-4B-tau2-grpo-v1 is a 4 billion parameter Qwen3-based language model, fine-tuned specifically for multi-turn tool-use tasks. It achieves 59% Pass@4 on the tau2-bench test split, representing a significant improvement over its base model. This model excels at complex agentic workflows requiring sequential tool interactions, making it suitable for applications needing robust function calling capabilities.
Loading preview...
Model Overview
Jarrodbarnes/Qwen3-4B-tau2-grpo-v1 is a 4 billion parameter model built on the Qwen3-4B-Instruct base, specifically optimized for multi-turn tool-use tasks. It demonstrates a 4x improvement over the base model in agentic capabilities, achieving 59% Pass@4 on the challenging tau2-bench test split.
Key Capabilities & Training
This model's advanced performance stems from a progressive three-stage training pipeline:
- SFT (Supervised Fine-Tuning): Initial learning of tool schemas and interaction protocols.
- RFT (Rejection Fine-Tuning): Focusing on high-quality interaction trajectories.
- GRPO (Group Relative Policy Optimization): Reinforcement learning with turn-level reward shaping for complex multi-step reasoning.
This methodology enables the model to effectively handle sequential function calls and complex agent workflows, as detailed in the tau2 training cookbook.
Performance Highlights
On the tau2-bench test split, the model achieves:
- Overall Pass@4: 59.0%
- Overall Pass@1: 36.0%
This significantly surpasses the baseline Qwen3-4B-Instruct, which scored 14.3% Pass@4, showcasing the effectiveness of the GRPO fine-tuning for agentic tasks.
Use Cases
This model is particularly well-suited for applications requiring:
- Multi-turn function calling: Executing a sequence of tool interactions to complete a complex task.
- Agentic workflows: Building AI agents that can reason and act over multiple steps.
- Automated task completion: Handling structured interactions in domains like retail and airline services, though telecom tasks remain more challenging.
Limitations
Users should note challenges in the Telecom domain (40% Pass@4) and sensitivity to the user simulator used during evaluation. The reported Pass@k metric differs from the Pass^k used on the official tau2-bench leaderboard.