distillabs/tft-benchmark-s3-tft-Qwen3-1.7B

TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Apr 15, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The distillabs/tft-benchmark-s3-tft-Qwen3-1.7B is a 1.7 billion parameter Qwen3 model fine-tuned for multi-turn tool calling. Developed by Distil Labs, it is specifically optimized for scenarios involving schema drift in production traces. This model demonstrates robust performance in complex tool-use tasks, achieving an LLM-as-a-judge score of 0.844 and a staged_tool_call score of 0.748 in schema drift conditions.

Loading preview...

Overview

This model, tft-benchmark-s3-tft-Qwen3-1.7B, is a 1.7 billion parameter Qwen3 variant developed by Distil Labs. It has been fine-tuned for multi-turn tool calling, specifically addressing challenges presented by schema drift in production traces. This model is a key component of the TFT (Training from Traces) Benchmark, which evaluates different approaches to training Small Language Models (SLMs) from real-world data.

Key Capabilities & Training

  • Multi-turn Tool Calling: Excels at complex interactions requiring sequential tool use, such as restaurant search and reservation (FindRestaurants, ReserveRestaurant, respond_to_user).
  • Robustness to Schema Drift: Specifically trained and evaluated on a scenario where function and parameter names are randomly renamed, simulating real-world schema changes. It significantly outperforms direct training methods in these conditions.
  • TFT Pipeline: Utilizes a sophisticated training pipeline involving trace filtering, committee relabeling by multiple LLMs (e.g., openai.gpt-oss-120b, zai.glm-5), and synthetic data generation to enhance learning from corrupted or noisy production traces.
  • LoRA Fine-tuning: The model was fine-tuned using LoRA, with merged weights, on a synthetic dataset derived from the Schema-Guided Dialogue (SGD) dataset.

Performance Highlights

  • Achieved an LLM-as-a-judge score of 0.844 and a staged_tool_call score of 0.748 in the S3 Schema Drift scenario.
  • In benchmark comparisons, the TFT pipeline, which this model uses, consistently outperforms direct training by 12-26 percentage points across various corrupted data scenarios, while matching performance on clean data.

Use Cases

This model is particularly well-suited for applications requiring reliable multi-turn tool calling in environments where underlying API schemas may evolve or be inconsistent, making it valuable for robust conversational AI and agentic systems.