distillabs/tft-benchmark-s4-direct-Qwen3-1.7B
The distillabs/tft-benchmark-s4-direct-Qwen3-1.7B is a 1.7 billion parameter Qwen3 model, fine-tuned for multi-turn tool calling within the TFT (Training from Traces) Benchmark. This model was trained directly on raw production traces under a low-data scenario (S4), achieving an LLM-as-a-judge score of 0.649. It is specifically designed for evaluating direct training approaches in tool-calling tasks, particularly for restaurant search and reservation functions.
Loading preview...
Model Overview
This model, tft-benchmark-s4-direct-Qwen3-1.7B, is a Qwen3-1.7B variant specifically fine-tuned for multi-turn tool calling. It is a component of the TFT (Training from Traces) Benchmark, which evaluates different approaches to training Small Language Models (SLMs) from production traces.
Key Characteristics
- Base Model: Qwen3-1.7B, a 1.7 billion parameter model.
- Training Pipeline: Utilizes a "Direct Training" approach, meaning it was fine-tuned directly on raw, expanded production traces without filtering, relabeling, or synthetic data generation.
- Scenario: Trained under the "S4 Low Data" scenario, using only 5 clean
Restaurants_1traces, representing extreme data scarcity. - Performance: Achieved an LLM-as-a-judge score of 0.649 and a
staged_tool_callscore of 0.66 on a held-out test set of 34 multi-turn conversations. - Target Tools: Designed to interact with tools for restaurant search (
FindRestaurants) and reservation (ReserveRestaurant), based on the Schema-Guided Dialogue (SGD) dataset.
When to Use This Model
This model is particularly relevant for researchers and developers interested in:
- Benchmarking Direct Training: Understanding the performance of direct fine-tuning on raw, potentially noisy, and scarce production traces for tool-calling tasks.
- Low-Data Scenarios: Evaluating model behavior and limitations when trained with very limited clean data.
- Comparison with TFT Pipeline: Comparing its performance against models trained with the more sophisticated TFT pipeline (trace filtering, relabeling, synthetic data generation), which significantly outperforms direct training in corrupted scenarios (e.g., +20.3pp in S4 Low Data).