distillabs/tft-benchmark-s4-tft-Qwen3-1.7B

TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Apr 15, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The distillabs/tft-benchmark-s4-tft-Qwen3-1.7B is a 1.7 billion parameter Qwen3 model fine-tuned by Distil Labs for multi-turn tool calling. It was developed as part of the TFT Benchmark to evaluate training methods for Small Language Models (SLMs) from production traces. This model excels in low-data scenarios, achieving a 0.852 LLM-as-a-judge score and 0.74 staged_tool_call score, demonstrating robust performance in tool-use tasks even with extreme data scarcity.

Loading preview...

Overview

This model, tft-benchmark-s4-tft-Qwen3-1.7B, is a 1.7 billion parameter Qwen3 variant developed by Distil Labs. It is specifically fine-tuned for multi-turn tool calling within the context of the TFT (Training from Traces) Benchmark. The benchmark evaluates two distinct approaches for training SLMs from production traces: the TFT Pipeline and Direct Training.

Key Capabilities & Performance

  • Multi-turn Tool Calling: Optimized for complex conversational interactions requiring tool use, such as restaurant search and reservation based on the Schema-Guided Dialogue (SGD) dataset.
  • Robustness in Low-Data Scenarios: This specific model was trained under the 'S4 Low Data' scenario, utilizing only 5 clean traces. Despite extreme data scarcity, it achieved an LLM-as-a-judge score of 0.852 and a staged_tool_call score of 0.74.
  • TFT Pipeline Advantage: The TFT pipeline, which involves trace filtering, committee relabeling, and synthetic data generation, significantly outperforms Direct Training in corrupted or scarce data scenarios, showing a +20.3 percentage point improvement over Direct Training in the S4 Low Data scenario.

Training Methodology

The model was trained using the TFT pipeline, where production traces are filtered, relabeled by a committee of LLMs (openai.gpt-oss-120b + zai.glm-5), and then used to seed synthetic data generation. The student model is subsequently fine-tuned on this synthetic dataset using LoRA. The teacher/synthetic generation model used was zai.glm-5, and the judge model was openai.gpt-oss-120b.

Target Tools

It supports tools for restaurant search (FindRestaurants) and reservation (ReserveRestaurant), along with a general respond_to_user function.

When to Use This Model

This model is particularly well-suited for applications requiring reliable multi-turn tool calling, especially when working with limited or noisy production trace data for fine-tuning. Its strong performance in low-data conditions makes it valuable for developing robust conversational AI agents where data collection is challenging.