reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Mar 22, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B is a 1.7 billion parameter Qwen3 causal language model, distilled from a 30B-parameter teacher using discrepancy-informed knowledge distillation. It is specifically optimized for STEM chain-of-thought reasoning, emphasizing proof structure and identifying reasoning pivots through token-level divergence dynamics. This model excels at mathematical derivations, physics problem-solving, and educational tutoring, offering lightweight reasoning deployment for complex STEM tasks.

Loading preview...

Model Overview

This model, Qwen3-1.7B-Distilled-30B-A3B, is a 1.7 billion parameter causal language model based on the Qwen3 architecture. It was developed by Convergent Intelligence LLC: Research Division and distilled from a Qwen3-30B-A3B-Instruct teacher model. The core innovation lies in its discrepancy-informed knowledge distillation (DISC v3) methodology, which focuses on transferring complex reasoning structures from a larger teacher to a smaller student model.

Key Distillation Innovations

Unlike standard knowledge distillation, this model employs three unique operators to enhance reasoning transfer:

  • Discrepancy-Weighted KD: Identifies and amplifies distillation weight for "reasoning pivot" tokens where the teacher and student diverge sharply, ensuring the student learns critical structural transitions.
  • DG-Limit Smoothing: Stabilizes training by smoothing high-entropy (unstable) student tokens, preventing noisy gradients in incoherent regions.
  • Gap Energy Monitoring: Tracks structural divergence independent of average loss, regularizing the model to prevent degradation of reasoning transitions even if overall loss improves.

Training Details

The model was trained on 6,122 STEM chain-of-thought samples from 10 domain-specific datasets (e.g., Physics, Linear Algebra, Engineering). It uses a proof-weighted cross-entropy objective, where derivation spans receive higher weight, decaying from 2.5x to 1.5x during training. The training context length is 1024 tokens.

Good for

  • Mathematical derivations and worked solutions
  • Proof-style explanations in STEM fields
  • Physics and engineering problem-solving
  • Educational tutoring and STEM walkthroughs
  • Lightweight reasoning deployment where larger models are too expensive
  • Generator components in verifier-generator or retrieval-augmented reasoning systems