reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B
reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B is a 0.6 billion parameter Qwen3-based causal language model developed by Convergent Intelligence LLC. This model is uniquely distilled from a Qwen3-30B-A3B-Thinking teacher, emphasizing the transfer of rich, extended STEM reasoning traces through a proof-weighted loss function. Optimized for lightweight STEM reasoning, it excels at generating structured derivations for mathematical and scientific problems, making it suitable for edge devices and educational applications.
Loading preview...
Qwen3-0.6B STEM Proof Distilled (Thinking Teacher)
This 0.6 billion parameter model, developed by Convergent Intelligence LLC, is a highly compressed distillation from a 30 billion parameter Qwen3-A3B-Thinking teacher. It achieves a 50x parameter compression while retaining significant STEM reasoning capabilities, designed to produce structured derivations for complex problems.
Key Differentiators
- Thinking Teacher Distillation: Unlike standard distillation from Instruct models, this model learns from a "Thinking" variant teacher (Qwen3-30B-A3B-Thinking). This teacher generates extended internal reasoning paths with higher-entropy softmax distributions, allowing the student to learn a richer landscape of derivation strategies, not just final answers.
- Proof-Weighted Loss: During training, tokens within the derivation region (
Proof:toFinal Answer:) receive an amplified loss (2.5x decaying to 1.5x). This mechanism forces the model to prioritize and allocate its limited parameters to reasoning capability rather than mere boilerplate reproduction. - Mathematical Foundations: The distillation process is informed by Discrepancy Calculus, a measure-theoretic framework that quantifies local structural mismatches, ensuring a deeper transfer of reasoning structure.
Training Details
The model was trained on 6,122 STEM Chain-of-Thought samples across 12 domains, utilizing a combined loss function of 55% proof-weighted cross-entropy and 45% knowledge distillation KL divergence at T=2.0. The training context length was 1024 tokens.
Good For
- Lightweight STEM reasoning on edge/mobile devices
- Educational tutoring and proof drafting
- Component in multi-model pipelines requiring a small, fast reasoner
- IoT and embedded inference applications