distillabs/tft-benchmark-s2-tft-Qwen3-1.7B

TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Apr 15, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The distillabs/tft-benchmark-s2-tft-Qwen3-1.7B is a 2 billion parameter Qwen3 model, fine-tuned by Distil Labs for multi-turn tool calling. It was developed as part of the TFT (Training from Traces) Benchmark, specifically for scenarios with noisy labels. This model demonstrates robust performance in tool-calling tasks, achieving a 0.844 LLM-as-a-judge score in environments with 50% corrupted assistant tool calls, significantly outperforming direct training methods.

Loading preview...

Model Overview

The distillabs/tft-benchmark-s2-tft-Qwen3-1.7B is a 2 billion parameter Qwen3 model developed by Distil Labs. It is specifically fine-tuned for multi-turn tool calling within the context of the TFT (Training from Traces) Benchmark. This model addresses the challenge of training Small Language Models (SLMs) from production traces, particularly in scenarios with noisy or corrupted data.

Key Capabilities & Differentiators

  • Robust Tool Calling: Achieves an LLM-as-a-judge score of 0.844 and a staged_tool_call score of 0.758 in the S2 Noisy Labels scenario, where 50% of assistant tool calls are corrupted.
  • TFT Pipeline Advantage: Trained using the advanced TFT pipeline (trace filtering, committee relabeling, synthetic data generation, and fine-tuning), which significantly outperforms direct training on raw traces, especially with corrupted data. For instance, it shows a +12.3 percentage point improvement over direct training in the S2 Noisy Labels scenario.
  • Targeted Tool Use: Optimized for restaurant search and reservation tools, including respond_to_user, FindRestaurants, and ReserveRestaurant, based on the Schema-Guided Dialogue (SGD) dataset.

When to Use This Model

This model is ideal for applications requiring reliable multi-turn tool calling, particularly in environments where training data may contain noise or corruption. Its strength lies in its ability to handle imperfect production traces, making it suitable for developing robust conversational AI agents that interact with external tools for tasks like restaurant booking or information retrieval.