Overview

This model, tft-benchmark-s5-tft-Qwen3-1.7B, is a 1.7 billion parameter Qwen3 variant developed by Distil Labs. It is specifically fine-tuned for multi-turn tool calling within the framework of the TFT (Training from Traces) Benchmark. The model's primary purpose is to demonstrate the effectiveness of the TFT Pipeline, which involves trace filtering, committee relabeling, and synthetic data generation, over direct training on raw production traces.

Key Capabilities & Performance

Multi-turn Tool Calling: Optimized for complex conversational flows requiring multiple tool interactions.
Robustness to Data Corruption: Achieves an LLM-as-a-judge score of 0.858 and a staged_tool_call score of 0.741 in a scenario with mixed and shuffled trace data (S5 Trace Mixing).
Superiority in Corrupted Scenarios: Outperforms direct training methods by 12-26 percentage points on various corrupted data scenarios (e.g., noisy labels, schema drift, low data).
Targeted Tool Use: Proficient in using tools for restaurant search (FindRestaurants) and reservation (ReserveRestaurant), as well as general user responses (respond_to_user), based on the Schema-Guided Dialogue (SGD) dataset.

Training Methodology

The model was trained using the TFT pipeline, which involves:

Filtering production traces.
Committee-relabeling by multiple LLMs (openai.gpt-oss-120b + zai.glm-5).
Generating synthetic data from these relabeled traces.
LoRA fine-tuning the Qwen3-1.7B base model on the resulting synthetic dataset.

Use Cases

This model is particularly suitable for applications requiring reliable multi-turn tool calling, especially when training data might be derived from real-world, potentially noisy or mixed production traces. Its demonstrated robustness makes it a strong candidate for building conversational agents that interact with structured APIs for tasks like booking and searching.

Overview

Overview

Key Capabilities & Performance

Training Methodology

Use Cases

Full Model Card (README)