Overview

This model, tft-benchmark-s3-tft-Qwen3-1.7B, is a 1.7 billion parameter Qwen3 variant developed by Distil Labs. It has been fine-tuned for multi-turn tool calling, specifically addressing challenges presented by schema drift in production traces. This model is a key component of the TFT (Training from Traces) Benchmark, which evaluates different approaches to training Small Language Models (SLMs) from real-world data.

Key Capabilities & Training

Multi-turn Tool Calling: Excels at complex interactions requiring sequential tool use, such as restaurant search and reservation (FindRestaurants, ReserveRestaurant, respond_to_user).
Robustness to Schema Drift: Specifically trained and evaluated on a scenario where function and parameter names are randomly renamed, simulating real-world schema changes. It significantly outperforms direct training methods in these conditions.
TFT Pipeline: Utilizes a sophisticated training pipeline involving trace filtering, committee relabeling by multiple LLMs (e.g., openai.gpt-oss-120b, zai.glm-5), and synthetic data generation to enhance learning from corrupted or noisy production traces.
LoRA Fine-tuning: The model was fine-tuned using LoRA, with merged weights, on a synthetic dataset derived from the Schema-Guided Dialogue (SGD) dataset.

Performance Highlights

Achieved an LLM-as-a-judge score of 0.844 and a staged_tool_call score of 0.748 in the S3 Schema Drift scenario.
In benchmark comparisons, the TFT pipeline, which this model uses, consistently outperforms direct training by 12-26 percentage points across various corrupted data scenarios, while matching performance on clean data.

Use Cases

This model is particularly well-suited for applications requiring reliable multi-turn tool calling in environments where underlying API schemas may evolve or be inconsistent, making it valuable for robust conversational AI and agentic systems.

Overview

Overview

Key Capabilities & Training

Performance Highlights

Use Cases

Full Model Card (README)