distillabs/tft-benchmark-s5-tft-Qwen3-1.7B

TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Apr 15, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The distillabs/tft-benchmark-s5-tft-Qwen3-1.7B is a 1.7 billion parameter Qwen3 model, fine-tuned by Distil Labs for multi-turn tool calling. It was developed as part of the TFT (Training from Traces) Benchmark to evaluate training methods for Small Language Models (SLMs) using production traces. This model excels in scenarios with mixed and potentially corrupted trace data, demonstrating robust performance in tool-calling tasks, particularly for restaurant search and reservation functions.

Loading preview...

Overview

This model, tft-benchmark-s5-tft-Qwen3-1.7B, is a 1.7 billion parameter Qwen3 variant developed by Distil Labs. It is specifically fine-tuned for multi-turn tool calling within the framework of the TFT (Training from Traces) Benchmark. The model's primary purpose is to demonstrate the effectiveness of the TFT Pipeline, which involves trace filtering, committee relabeling, and synthetic data generation, over direct training on raw production traces.

Key Capabilities & Performance

  • Multi-turn Tool Calling: Optimized for complex conversational flows requiring multiple tool interactions.
  • Robustness to Data Corruption: Achieves an LLM-as-a-judge score of 0.858 and a staged_tool_call score of 0.741 in a scenario with mixed and shuffled trace data (S5 Trace Mixing).
  • Superiority in Corrupted Scenarios: Outperforms direct training methods by 12-26 percentage points on various corrupted data scenarios (e.g., noisy labels, schema drift, low data).
  • Targeted Tool Use: Proficient in using tools for restaurant search (FindRestaurants) and reservation (ReserveRestaurant), as well as general user responses (respond_to_user), based on the Schema-Guided Dialogue (SGD) dataset.

Training Methodology

The model was trained using the TFT pipeline, which involves:

  • Filtering production traces.
  • Committee-relabeling by multiple LLMs (openai.gpt-oss-120b + zai.glm-5).
  • Generating synthetic data from these relabeled traces.
  • LoRA fine-tuning the Qwen3-1.7B base model on the resulting synthetic dataset.

Use Cases

This model is particularly suitable for applications requiring reliable multi-turn tool calling, especially when training data might be derived from real-world, potentially noisy or mixed production traces. Its demonstrated robustness makes it a strong candidate for building conversational agents that interact with structured APIs for tasks like booking and searching.