distillabs/tft-benchmark-s2-direct-Qwen3-1.7B
The distillabs/tft-benchmark-s2-direct-Qwen3-1.7B is a 1.7 billion parameter Qwen3 model, developed by distillabs, fine-tuned for multi-turn tool calling. This model was trained directly on raw, noisy production traces as part of the TFT Benchmark, specifically for the S2 Noisy Labels scenario. It is designed to evaluate direct training approaches for small language models in tool-use contexts, achieving an LLM-as-a-judge score of 0.721 and a staged_tool_call score of 0.731.
Loading preview...
Model Overview
This model, tft-benchmark-s2-direct-Qwen3-1.7B, is a 1.7 billion parameter Qwen3 base model fine-tuned by distillabs for multi-turn tool calling. It is a component of the TFT (Training from Traces) Benchmark, which evaluates different approaches to training Small Language Models (SLMs) from production traces.
Key Characteristics
- Base Model: Qwen3-1.7B
- Training Pipeline: Direct Training, meaning it was fine-tuned directly on raw/corrupted traces without filtering, relabeling, or synthetic data generation.
- Scenario: Specifically trained for the S2 Noisy Labels scenario, which involves 327
Restaurants_1traces with 50% corrupted assistant tool calls, focusing on tool timing errors. - Performance: Achieved an LLM-as-a-judge score of 0.721 and a staged_tool_call score of 0.731 in its specific benchmark scenario.
- Target Tools: Designed to handle
respond_to_user,FindRestaurants, andReserveRestauranttools, based on the Schema-Guided Dialogue (SGD) dataset.
When to Consider This Model
This model is primarily a benchmark artifact, demonstrating the performance of direct training on noisy data for tool-calling tasks. It serves as a baseline for comparison against more sophisticated training pipelines like the TFT Pipeline, which significantly outperforms direct training in corrupted scenarios (e.g., +12.3 percentage points in S2 Noisy Labels). Developers interested in understanding the challenges of training SLMs on noisy production data for tool use, or evaluating alternative training methodologies, will find this model and its associated benchmark valuable.