What is reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B?
This is a 1.7 billion parameter causal language model, part of the Qwen3 family, developed by Convergent Intelligence LLC: Research Division. It is a distilled version of the larger Qwen3-30B-A3B-Instruct teacher model, specifically trained on 6,122 STEM chain-of-thought samples.
Key Differentiators & Training Methodology
Unlike standard knowledge distillation, this model employs a novel discrepancy-informed knowledge distillation (DISC) approach. This methodology focuses on the internal structure of reasoning by:
- Discrepancy-Weighted KD: Identifying and amplifying learning on "reasoning pivot tokens" where the derivation changes technique or introduces key concepts, rather than treating all tokens uniformly.
- DG-Limit Smoothing: Stabilizing training by smoothing high-entropy (unstable) student tokens to ensure more coherent local representations.
- Gap Energy Monitoring: Tracking structural divergence to prevent degradation of reasoning transitions, even if average loss improves.
Additionally, it uses proof-weighted cross-entropy, giving higher importance to tokens within the derivation span (from Proof: to Final Answer:), with emphasis decaying from 2.5x to 1.5x during training. This ensures the model prioritizes derivation quality over mere answer formatting.
Model Details
- Architecture: Qwen3 causal language model
- Parameters: ~2.031 billion
- Base Model: Qwen/Qwen3-1.7B
- Teacher Model: Qwen/Qwen3-30B-A3B-Instruct-2507
- Training Context Length: 1024 tokens
Intended Uses
This model is particularly well-suited for:
- Mathematical derivations and worked solutions
- Proof-style explanations in STEM fields
- Physics and engineering problem-solving
- Educational tutoring and STEM walkthroughs
- Lightweight reasoning deployment where larger models are too expensive
- Generator components in verifier-generator or retrieval-augmented reasoning systems.