distillabs/tft-benchmark-s3-direct-Qwen3-1.7B

TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Apr 15, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The distillabs/tft-benchmark-s3-direct-Qwen3-1.7B is a 1.7 billion parameter Qwen3 model, fine-tuned by Distil Labs for multi-turn tool calling within the TFT Benchmark. This model specifically addresses the 'S3 Schema Drift' scenario, where function and parameter names are randomly renamed, achieving an LLM-as-a-judge score of 0.585. It is designed to evaluate the 'Direct Training' approach, which involves fine-tuning directly on raw production traces without filtering or synthetic data generation. This model is optimized for understanding and executing tool calls in complex, schema-drifted conversational environments.

Loading preview...

Model Overview

This model, tft-benchmark-s3-direct-Qwen3-1.7B, is a Qwen3-1.7B variant developed by Distil Labs. It has been fine-tuned for multi-turn tool calling as part of the TFT (Training from Traces) Benchmark. This specific model represents the "Direct Training" pipeline, where it is fine-tuned directly on raw production traces without any filtering, relabeling, or synthetic data generation.

Key Capabilities & Performance

  • Multi-turn tool calling: Excels in conversational scenarios requiring sequential tool use.
  • Schema Drift Handling: Specifically trained and evaluated on a scenario where function and parameter names are randomly renamed, simulating schema drift in production data.
  • Benchmark Scores: Achieved an LLM-as-a-judge score of 0.585 and a staged_tool_call score of 0.499 in the S3 Schema Drift scenario.
  • Target Tools: Capable of using tools like respond_to_user, FindRestaurants, and ReserveRestaurant based on the Schema-Guided Dialogue (SGD) dataset.

Training Details

The model was fine-tuned using LoRA on a base Qwen3-1.7B model. It was trained to perform multi-turn tool calling in a closed-book setting, directly on raw production traces expanded into per-turn training examples. This model's performance is contrasted with the "TFT Pipeline" in the benchmark, which includes trace filtering, committee relabeling, and synthetic data generation. The benchmark highlights that the TFT pipeline significantly outperforms Direct Training in corrupted data scenarios, including schema drift.

Use Cases

This model is particularly useful for researchers and developers evaluating the effectiveness of different fine-tuning strategies for tool-calling LLMs, especially when dealing with noisy or schema-drifted production data. It serves as a baseline for understanding the challenges of direct training on raw traces compared to more sophisticated data preparation pipelines.