distillabs/tft-benchmark-s5-direct-Qwen3-1.7B

TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Apr 15, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The distillabs/tft-benchmark-s5-direct-Qwen3-1.7B is a 1.7 billion parameter Qwen3 model developed by Distil Labs, fine-tuned for multi-turn tool calling. It was trained using a 'Direct Training' pipeline on mixed production traces for the TFT benchmark, specifically for restaurant search and reservation tasks. This model serves as a baseline for evaluating the effectiveness of direct training on raw, potentially corrupted, production data compared to more refined training approaches.

Loading preview...

Overview

This model, tft-benchmark-s5-direct-Qwen3-1.7B, is a 1.7 billion parameter Qwen3 model developed by Distil Labs. It is specifically fine-tuned for multi-turn tool calling within the context of the TFT (Training from Traces) Benchmark. The model was trained using a 'Direct Training' pipeline, meaning it was fine-tuned directly on raw production traces without filtering, relabeling, or synthetic data generation.

Key Characteristics & Performance

  • Base Model: Qwen3-1.7B.
  • Training Method: LoRA fine-tuning on raw production traces, with merged weights.
  • Scenario: S5 Trace Mixing, involving 80% Hotels_1 and 20% Restaurants_1 traces, with shuffled message order and renamed function names.
  • Evaluation: Achieved an LLM-as-a-judge score of 0.694 and a staged_tool_call score of 0.74 on a held-out test set of multi-turn Restaurants_1 conversations.
  • Target Tools: Designed to handle respond_to_user, FindRestaurants (by cuisine, city, price, music, alcohol), and ReserveRestaurant (by name, city, time, date, party size) based on the Schema-Guided Dialogue (SGD) dataset.

Benchmark Context

This model is part of a larger benchmark comparing 'Direct Training' with a 'TFT Pipeline' (trace filtering, relabeling, synthetic data generation, finetuning). While 'Direct Training' performs similarly to TFT on clean data, the TFT pipeline significantly outperforms 'Direct Training' on corrupted scenarios, with a +16.4 percentage point difference in the S5 Trace Mixing scenario. This model highlights the challenges of training directly on noisy production data for tool-calling tasks.