Model Overview

This model, tft-benchmark-s3-direct-Qwen3-1.7B, is a Qwen3-1.7B variant developed by Distil Labs. It has been fine-tuned for multi-turn tool calling as part of the TFT (Training from Traces) Benchmark. This specific model represents the "Direct Training" pipeline, where it is fine-tuned directly on raw production traces without any filtering, relabeling, or synthetic data generation.

Key Capabilities & Performance

Multi-turn tool calling: Excels in conversational scenarios requiring sequential tool use.
Schema Drift Handling: Specifically trained and evaluated on a scenario where function and parameter names are randomly renamed, simulating schema drift in production data.
Benchmark Scores: Achieved an LLM-as-a-judge score of 0.585 and a staged_tool_call score of 0.499 in the S3 Schema Drift scenario.
Target Tools: Capable of using tools like respond_to_user, FindRestaurants, and ReserveRestaurant based on the Schema-Guided Dialogue (SGD) dataset.

Training Details

The model was fine-tuned using LoRA on a base Qwen3-1.7B model. It was trained to perform multi-turn tool calling in a closed-book setting, directly on raw production traces expanded into per-turn training examples. This model's performance is contrasted with the "TFT Pipeline" in the benchmark, which includes trace filtering, committee relabeling, and synthetic data generation. The benchmark highlights that the TFT pipeline significantly outperforms Direct Training in corrupted data scenarios, including schema drift.

Use Cases

This model is particularly useful for researchers and developers evaluating the effectiveness of different fine-tuning strategies for tool-calling LLMs, especially when dealing with noisy or schema-drifted production data. It serves as a baseline for understanding the challenges of direct training on raw traces compared to more sophisticated data preparation pipelines.

Overview

Model Overview

Key Capabilities & Performance

Training Details

Use Cases

Full Model Card (README)