Overview

This model, tft-benchmark-s5-direct-Qwen3-1.7B, is a 1.7 billion parameter Qwen3 model developed by Distil Labs. It is specifically fine-tuned for multi-turn tool calling within the context of the TFT (Training from Traces) Benchmark. The model was trained using a 'Direct Training' pipeline, meaning it was fine-tuned directly on raw production traces without filtering, relabeling, or synthetic data generation.

Key Characteristics & Performance

Base Model: Qwen3-1.7B.
Training Method: LoRA fine-tuning on raw production traces, with merged weights.
Scenario: S5 Trace Mixing, involving 80% Hotels_1 and 20% Restaurants_1 traces, with shuffled message order and renamed function names.
Evaluation: Achieved an LLM-as-a-judge score of 0.694 and a staged_tool_call score of 0.74 on a held-out test set of multi-turn Restaurants_1 conversations.
Target Tools: Designed to handle respond_to_user, FindRestaurants (by cuisine, city, price, music, alcohol), and ReserveRestaurant (by name, city, time, date, party size) based on the Schema-Guided Dialogue (SGD) dataset.

Benchmark Context

This model is part of a larger benchmark comparing 'Direct Training' with a 'TFT Pipeline' (trace filtering, relabeling, synthetic data generation, finetuning). While 'Direct Training' performs similarly to TFT on clean data, the TFT pipeline significantly outperforms 'Direct Training' on corrupted scenarios, with a +16.4 percentage point difference in the S5 Trace Mixing scenario. This model highlights the challenges of training directly on noisy production data for tool-calling tasks.

Overview

Overview

Key Characteristics & Performance

Benchmark Context

Full Model Card (README)