reaperdoesntknow/Qwen3-1.7B-Thinking-Distil
Qwen3-1.7B-Thinking-Distil is a 1.7 billion parameter Qwen3ForCausalLM model developed by Convergent Intelligence LLC: Research Division. It is a distilled version of the Qwen3-30B-A3B-Thinking teacher model, specifically fine-tuned to capture extended deliberation patterns and long-form reasoning chains. With a context length of 40,960 tokens, this model excels at tasks requiring deep, step-by-step reasoning and internal monologue before arriving at a conclusion.
Loading preview...
Overview
Qwen3-1.7B-Thinking-Distil is a 1.7 billion parameter Qwen3ForCausalLM model from Convergent Intelligence LLC: Research Division, designed to emulate the extended reasoning capabilities of its larger 30 billion parameter teacher, Qwen3-30B-A3B-Thinking. This model was created through supervised fine-tuning (SFT) on the longwriter-6k dataset, which specifically contains long-form generation samples preserving detailed reasoning chains. Unlike other distillation methods, this approach focuses on transferring the teacher's deliberative depth and internal monologue patterns rather than just final token probabilities.
Key Capabilities
- Extended Reasoning: Captures and reproduces the teacher model's ability to generate long-form reasoning chains, including re-evaluation and backtracking.
- Deliberative Depth: Excels at tasks requiring a step-by-step thought process before committing to an answer.
- Efficient Size: Compresses complex reasoning capabilities into a compact 1.7B parameter model, making it more efficient for deployment.
- High Context Length: Supports a maximum context length of 40,960 tokens, allowing for extensive input and output.
Good For
- Complex Problem Solving: Ideal for applications where the model needs to "think aloud" or show its work to arrive at a solution.
- Analytical Tasks: Suitable for scenarios requiring detailed explanations, logical deductions, and structured thought processes.
- Knowledge Distillation Research: Demonstrates an effective method for transferring sophisticated reasoning patterns from large teachers to smaller student models via SFT on specialized datasets.
This model is part of the broader DistilQwen collection, which explores various distillation targets (Instruct, Thinking, Coder) and methodologies, with this variant specifically prioritizing extended reasoning through direct SFT.