Jarrodbarnes/Qwen3-4B-tau2-grpo-v1

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Jan 16, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

Jarrodbarnes/Qwen3-4B-tau2-grpo-v1 is a 4 billion parameter Qwen3-based language model, fine-tuned specifically for multi-turn tool-use tasks. It achieves 59% Pass@4 on the tau2-bench test split, representing a significant improvement over its base model. This model excels at complex agentic workflows requiring sequential tool interactions, making it suitable for applications needing robust function calling capabilities.

Loading preview...

Model Overview

Jarrodbarnes/Qwen3-4B-tau2-grpo-v1 is a 4 billion parameter model built on the Qwen3-4B-Instruct base, specifically optimized for multi-turn tool-use tasks. It demonstrates a 4x improvement over the base model in agentic capabilities, achieving 59% Pass@4 on the challenging tau2-bench test split.

Key Capabilities & Training

This model's advanced performance stems from a progressive three-stage training pipeline:

  • SFT (Supervised Fine-Tuning): Initial learning of tool schemas and interaction protocols.
  • RFT (Rejection Fine-Tuning): Focusing on high-quality interaction trajectories.
  • GRPO (Group Relative Policy Optimization): Reinforcement learning with turn-level reward shaping for complex multi-step reasoning.

This methodology enables the model to effectively handle sequential function calls and complex agent workflows, as detailed in the tau2 training cookbook.

Performance Highlights

On the tau2-bench test split, the model achieves:

  • Overall Pass@4: 59.0%
  • Overall Pass@1: 36.0%

This significantly surpasses the baseline Qwen3-4B-Instruct, which scored 14.3% Pass@4, showcasing the effectiveness of the GRPO fine-tuning for agentic tasks.

Use Cases

This model is particularly well-suited for applications requiring:

  • Multi-turn function calling: Executing a sequence of tool interactions to complete a complex task.
  • Agentic workflows: Building AI agents that can reason and act over multiple steps.
  • Automated task completion: Handling structured interactions in domains like retail and airline services, though telecom tasks remain more challenging.

Limitations

Users should note challenges in the Telecom domain (40% Pass@4) and sensitivity to the user simulator used during evaluation. The reported Pass@k metric differs from the Pass^k used on the official tau2-bench leaderboard.