reaperdoesntknow/Qwen3-1.7B-Thinking-Distil

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Mar 27, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

Qwen3-1.7B-Thinking-Distil is a 1.7 billion parameter Qwen3ForCausalLM model developed by Convergent Intelligence LLC: Research Division. It is a distilled version of the Qwen3-30B-A3B-Thinking teacher model, specifically fine-tuned to capture extended deliberation and reasoning patterns. With a context length of 40,960 tokens, this model excels at generating long-form reasoning chains and internal monologues before arriving at a conclusion, making it suitable for complex problem-solving tasks requiring deep thought processes.

Loading preview...

Overview

Qwen3-1.7B-Thinking-Distil is a 1.7 billion parameter model from Convergent Intelligence LLC: Research Division, designed to distill the extended reasoning capabilities of the larger Qwen3-30B-A3B-Thinking teacher model. It captures the teacher's deliberative patterns, including reasoning through uncertainty, backtracking, and re-evaluation, into a smaller, more efficient student model.

Key Capabilities

  • Extended Reasoning: Specializes in generating long-form reasoning chains and internal monologues, mimicking the thought process of a larger model.
  • Deliberative Depth: Captures the nuanced signal of a "Thinking" teacher, focusing on how a model approaches, reconsiders, and resolves complex problems.
  • Efficient Distillation: Achieves these advanced reasoning capabilities in a 1.7B parameter model, making it highly efficient for deployment.
  • High Context Length: Supports a maximum context length of 40,960 tokens, allowing for extensive input and output.

Training Details

The model was trained via Supervised Fine-Tuning (SFT) using the longwriter-6k dataset, which consists of long-form generation samples preserving extended reasoning chains. This direct SFT approach, rather than logit-level Knowledge Distillation (KD), was chosen to effectively transfer the structural signal of the teacher's reasoning process.

Good For

  • Applications requiring models to "think aloud" or show their reasoning steps.
  • Tasks benefiting from extended deliberation before providing a final answer.
  • Scenarios where a smaller model with advanced reasoning capabilities is preferred over larger, more resource-intensive alternatives.
  • Generating detailed explanations, problem-solving narratives, or complex analytical responses.