distillabs/tft-benchmark-s5-direct-Qwen3-1.7B
The distillabs/tft-benchmark-s5-direct-Qwen3-1.7B is a 1.7 billion parameter Qwen3 model developed by Distil Labs, fine-tuned for multi-turn tool calling. It was trained using a 'Direct Training' pipeline on mixed production traces for the TFT benchmark, specifically for restaurant search and reservation tasks. This model serves as a baseline for evaluating the effectiveness of direct training on raw, potentially corrupted, production data compared to more refined training approaches.
Loading preview...
Overview
This model, tft-benchmark-s5-direct-Qwen3-1.7B, is a 1.7 billion parameter Qwen3 model developed by Distil Labs. It is specifically fine-tuned for multi-turn tool calling within the context of the TFT (Training from Traces) Benchmark. The model was trained using a 'Direct Training' pipeline, meaning it was fine-tuned directly on raw production traces without filtering, relabeling, or synthetic data generation.
Key Characteristics & Performance
- Base Model: Qwen3-1.7B.
- Training Method: LoRA fine-tuning on raw production traces, with merged weights.
- Scenario: S5 Trace Mixing, involving 80% Hotels_1 and 20% Restaurants_1 traces, with shuffled message order and renamed function names.
- Evaluation: Achieved an LLM-as-a-judge score of 0.694 and a staged_tool_call score of 0.74 on a held-out test set of multi-turn Restaurants_1 conversations.
- Target Tools: Designed to handle
respond_to_user,FindRestaurants(by cuisine, city, price, music, alcohol), andReserveRestaurant(by name, city, time, date, party size) based on the Schema-Guided Dialogue (SGD) dataset.
Benchmark Context
This model is part of a larger benchmark comparing 'Direct Training' with a 'TFT Pipeline' (trace filtering, relabeling, synthetic data generation, finetuning). While 'Direct Training' performs similarly to TFT on clean data, the TFT pipeline significantly outperforms 'Direct Training' on corrupted scenarios, with a +16.4 percentage point difference in the S5 Trace Mixing scenario. This model highlights the challenges of training directly on noisy production data for tool-calling tasks.