distillabs/tft-benchmark-s1-direct-Qwen3-1.7B

TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Apr 15, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The distillabs/tft-benchmark-s1-direct-Qwen3-1.7B is a 2 billion parameter Qwen3-based model, fine-tuned by Distil Labs for multi-turn tool calling. It was trained using a direct approach on clean production traces as part of the TFT Benchmark, achieving an LLM-as-a-judge score of 0.864. This model is specifically designed for tasks involving structured tool use, such as restaurant search and reservation, demonstrating performance on par with more complex pipelines on uncorrupted data.

Loading preview...

Model Overview

This distillabs/tft-benchmark-s1-direct-Qwen3-1.7B model is a 2 billion parameter Qwen3 variant, specifically fine-tuned by Distil Labs for multi-turn tool calling. It is a component of the TFT (Training from Traces) Benchmark, which evaluates different methods for training Small Language Models (SLMs) from production traces.

Key Characteristics

  • Base Model: Qwen3-1.7B, a 2 billion parameter language model.
  • Training Method: Utilizes "Direct Training," where the model is fine-tuned directly on raw production traces without additional filtering, relabeling, or synthetic data generation.
  • Benchmark Scenario: Evaluated in the S1 Baseline scenario, which uses 327 clean Restaurants_1 traces, representing a high-quality data environment.
  • Performance: Achieved an LLM-as-a-judge score of 0.864 and a staged_tool_call score of 0.787 on the S1 Baseline, indicating strong performance on clean data.
  • Target Tools: Designed to interact with tools for restaurant search (FindRestaurants), reservation (ReserveRestaurant), and user responses (respond_to_user), based on the Schema-Guided Dialogue (SGD) dataset.

Use Case and Differentiation

This model is particularly suited for applications requiring multi-turn tool calling in environments with clean, uncorrupted training data. Its direct training approach makes it a baseline for comparison against more complex pipelines like the TFT Pipeline, which includes trace filtering and synthetic data generation. While it performs comparably to the TFT Pipeline on clean data (S1 Baseline), the TFT Pipeline shows significant advantages (12-26 percentage points) in scenarios with noisy labels, schema drift, or low data availability. Therefore, this model is ideal for use cases where the training data quality is consistently high, offering a straightforward and effective solution for structured dialogue and tool interaction.