distillabs/tft-benchmark-s3-tft-Qwen3-1.7B
The distillabs/tft-benchmark-s3-tft-Qwen3-1.7B is a 1.7 billion parameter Qwen3 model fine-tuned for multi-turn tool calling. Developed by Distil Labs, it is specifically optimized for scenarios involving schema drift in production traces. This model demonstrates robust performance in complex tool-use tasks, achieving an LLM-as-a-judge score of 0.844 and a staged_tool_call score of 0.748 in schema drift conditions.
Loading preview...
Overview
This model, tft-benchmark-s3-tft-Qwen3-1.7B, is a 1.7 billion parameter Qwen3 variant developed by Distil Labs. It has been fine-tuned for multi-turn tool calling, specifically addressing challenges presented by schema drift in production traces. This model is a key component of the TFT (Training from Traces) Benchmark, which evaluates different approaches to training Small Language Models (SLMs) from real-world data.
Key Capabilities & Training
- Multi-turn Tool Calling: Excels at complex interactions requiring sequential tool use, such as restaurant search and reservation (
FindRestaurants,ReserveRestaurant,respond_to_user). - Robustness to Schema Drift: Specifically trained and evaluated on a scenario where function and parameter names are randomly renamed, simulating real-world schema changes. It significantly outperforms direct training methods in these conditions.
- TFT Pipeline: Utilizes a sophisticated training pipeline involving trace filtering, committee relabeling by multiple LLMs (e.g.,
openai.gpt-oss-120b,zai.glm-5), and synthetic data generation to enhance learning from corrupted or noisy production traces. - LoRA Fine-tuning: The model was fine-tuned using LoRA, with merged weights, on a synthetic dataset derived from the Schema-Guided Dialogue (SGD) dataset.
Performance Highlights
- Achieved an LLM-as-a-judge score of 0.844 and a staged_tool_call score of 0.748 in the S3 Schema Drift scenario.
- In benchmark comparisons, the TFT pipeline, which this model uses, consistently outperforms direct training by 12-26 percentage points across various corrupted data scenarios, while matching performance on clean data.
Use Cases
This model is particularly well-suited for applications requiring reliable multi-turn tool calling in environments where underlying API schemas may evolve or be inconsistent, making it valuable for robust conversational AI and agentic systems.