Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT: Ultra-Lightweight Reasoning
This model, developed by Convergent Intelligence LLC, is a 0.6 billion parameter Qwen3-based language model engineered for efficient reasoning in specialized domains. It achieves a 50x compression from its 30B-parameter teacher, resulting in a model under 500MB (quantized) that can run on mobile devices.
Its core innovation lies in a two-stage training pipeline:
Key Capabilities
- Structured Reasoning Backbone: Stage 1 involved knowledge distillation from a 30B-parameter "Thinking" teacher model (Qwen/Qwen3-30B-A3B-Thinking-2507) using 6,122 STEM chain-of-thought samples. This process, utilizing a proof-weighted cross-entropy loss and KL divergence at T=2.0, transferred a deep reasoning structure by emphasizing derivation steps over final answers.
- Legal Domain Specialization: Stage 2 applied supervised fine-tuning on the Alignment-Lab-AI/Lawyer-Instruct dataset. This leverages the STEM-derived reasoning structure, which is considered isomorphic to legal analysis, for efficient adaptation to legal instruction-following.
- Extreme Compression: The model achieves significant compression while retaining a structured reasoning capability, making it suitable for resource-constrained environments.
- Dual Prompt Formats: Supports both a "Proof:" format for STEM derivations and an "### Instruction: / ### Response:" format for general instruction-following.
Good For
- Ultra-lightweight reasoning on mobile, edge, and IoT devices.
- Legal and STEM instruction-following tasks requiring structured derivation.
- Educational tutoring and embedded inference applications.
- Serving as a component in multi-model pipelines where a compact, reasoning-capable model is needed.
- Use cases requiring under 500MB model footprint.
Limitations: Due to its 0.6B parameter size, the model has capacity constraints and may make reasoning errors that larger models would not. It is not intended for formal proof verification, actual legal counsel, safety-critical analysis, or complex multi-step proofs beyond ~8 steps. Its context length is limited to 1024 tokens.