distillabs/tft-benchmark-s4-direct-Qwen3-1.7B

TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Apr 15, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The distillabs/tft-benchmark-s4-direct-Qwen3-1.7B is a 1.7 billion parameter Qwen3 model, fine-tuned for multi-turn tool calling within the TFT (Training from Traces) Benchmark. This model was trained directly on raw production traces under a low-data scenario (S4), achieving an LLM-as-a-judge score of 0.649. It is specifically designed for evaluating direct training approaches in tool-calling tasks, particularly for restaurant search and reservation functions.

Loading preview...

Model Overview

This model, tft-benchmark-s4-direct-Qwen3-1.7B, is a Qwen3-1.7B variant specifically fine-tuned for multi-turn tool calling. It is a component of the TFT (Training from Traces) Benchmark, which evaluates different approaches to training Small Language Models (SLMs) from production traces.

Key Characteristics

  • Base Model: Qwen3-1.7B, a 1.7 billion parameter model.
  • Training Pipeline: Utilizes a "Direct Training" approach, meaning it was fine-tuned directly on raw, expanded production traces without filtering, relabeling, or synthetic data generation.
  • Scenario: Trained under the "S4 Low Data" scenario, using only 5 clean Restaurants_1 traces, representing extreme data scarcity.
  • Performance: Achieved an LLM-as-a-judge score of 0.649 and a staged_tool_call score of 0.66 on a held-out test set of 34 multi-turn conversations.
  • Target Tools: Designed to interact with tools for restaurant search (FindRestaurants) and reservation (ReserveRestaurant), based on the Schema-Guided Dialogue (SGD) dataset.

When to Use This Model

This model is particularly relevant for researchers and developers interested in:

  • Benchmarking Direct Training: Understanding the performance of direct fine-tuning on raw, potentially noisy, and scarce production traces for tool-calling tasks.
  • Low-Data Scenarios: Evaluating model behavior and limitations when trained with very limited clean data.
  • Comparison with TFT Pipeline: Comparing its performance against models trained with the more sophisticated TFT pipeline (trace filtering, relabeling, synthetic data generation), which significantly outperforms direct training in corrupted scenarios (e.g., +20.3pp in S4 Low Data).