Jarrodbarnes/Qwen3-4B-tau2-sft1
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Jan 16, 2026License:otherArchitecture:Transformer Cold

Jarrodbarnes/Qwen3-4B-tau2-sft1 is a 4 billion parameter supervised fine-tuned (SFT) model based on Qwen/Qwen3-4B-Instruct-2507, specifically optimized for tool-use tasks. It was trained using the Slime tau2 training cookbook on rejection-sampled trajectories from the Jarrodbarnes/tau2-sft-seed-v3 dataset. This model is designed for research and reproduction of tau2-bench tool-use training, demonstrating a 0.40 pass@1 score on the tau2-bench test split across airline, retail, and telecom domains.

Loading preview...

Jarrodbarnes/Qwen3-4B-tau2-sft1: Tool-Use Fine-Tuned Model

This model is a 4 billion parameter supervised fine-tuned (SFT) checkpoint, built upon the Qwen/Qwen3-4B-Instruct-2507 base model. Its primary focus is on tool-use tasks, specifically within the context of the tau2-bench framework.

Key Characteristics & Training

  • Base Model: Qwen/Qwen3-4B-Instruct-2507.
  • Fine-tuning: Supervised fine-tuning (SFT) using the Slime tau2 training cookbook.
  • Training Data: Utilizes the Jarrodbarnes/tau2-sft-seed-v3 dataset, which consists of filtered, rejection-sampled trajectories.
  • Hyperparameters: Key settings include num_epoch=2, global_batch_size=16, and a learning rate of 1e-5 with cosine decay.

Performance on tau2-bench

The model was evaluated on the tau2-bench test split (100 tasks) using the pass@1 metric (any-success over 1 attempt):

  • Overall pass@1: 0.40
  • Domain-specific pass@1:
    • Airline: 0.20 (20 tasks)
    • Retail: 0.60 (40 tasks)
    • Telecom: 0.30 (40 tasks)

Intended Use

This model is specifically intended for research and reproduction of tau2-bench tool-use training. It is not recommended for deployment without further safety evaluation. The evaluation results are subject to small variance due to the user simulator's stochastic nature.