reaperdoesntknow/TopologicalQwen
Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Mar 28, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

TopologicalQwen is a 1.7 billion parameter Qwen3ForCausalLM model developed by Convergent Intelligence LLC: Research Division. It is distilled from Qwen3-30B-A3B using Topological Knowledge Distillation (TKD), a novel methodology that captures structural information beyond standard KL divergence. This model excels at complex reasoning tasks, particularly in physics and mathematics, by learning a cognitive loop of derivation, self-critique, and synthesis.

Loading preview...

TopologicalQwen: Topology-Aware Knowledge Distillation

TopologicalQwen is a 1.7 billion parameter model based on the Qwen3ForCausalLM architecture, developed by Convergent Intelligence LLC: Research Division. It stands out due to its unique Topological Knowledge Distillation (TKD) methodology, which goes beyond traditional distillation by decomposing knowledge transfer into three channels: smooth distillation, jump corrections, and drift corrections. This allows the model to preserve the teacher's structural understanding, including topic shifts and reasoning mode transitions, which are often blurred by standard KD methods.

Key Capabilities & Features

  • Topology-Aware Distillation: Utilizes Discrepancy Calculus (DISC) to detect and preserve structural features (jumps, drifts) in the teacher's output distribution, leading to superior reasoning quality at a smaller scale.
  • DualMind Format: Trained to generate responses in a structured <explore> (derivation), <examine> (self-critique), and <response> (clean answer) format, mimicking a cognitive loop for enhanced problem-solving.
  • Physics CoT Training: Fine-tuned on specialized Chain-of-Thought datasets covering differential equations, theoretical mechanics, electromagnetism, and general relativity.
  • Efficient Architecture: Features a Qwen3ForCausalLM base with 2.03B parameters (1.7B effective), 40,960 tokens context length, and Grouped Query Attention (GQA).

What Makes This Different

Unlike other distillation methods that treat the teacher's output as a smooth function, TKD explicitly accounts for discontinuities and structural shifts in the knowledge manifold. This enables TopologicalQwen to achieve reasoning quality typically associated with much larger models, even at 1.7B parameters. It represents the result of applying the proven TKD methodology with premium compute (Colab H100, BF16 precision) to a 30B-parameter teacher, demonstrating that structure can indeed beat scale.

Good For

  • Complex Reasoning Tasks: Particularly in scientific and mathematical domains requiring structured derivation and self-correction.
  • Small-Scale Deployment: Offers advanced reasoning capabilities in a compact 1.7B parameter model, suitable for resource-constrained environments.
  • Research & Development: Ideal for exploring advanced knowledge distillation techniques and structured cognitive architectures in LLMs.