Minsang/TSD-KD_Qwen2.5-1.5B

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 8, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

Minsang/TSD-KD_Qwen2.5-1.5B is a 1.5 billion parameter causal language model developed by Minsang, based on the Qwen2.5 architecture. This model is specifically designed for improved reasoning capabilities through Token-Selective Dual Knowledge Distillation. It is optimized for tasks requiring enhanced reasoning, as detailed in the ICLR 2026 paper "Explain in Your Own Words: Improving Reasoning via Token-Selective Dual Knowledge Distillation", and supports a context length of 32768 tokens.

Loading preview...

Overview

Minsang/TSD-KD_Qwen2.5-1.5B is a 1.5 billion parameter language model built upon the Qwen2.5 architecture. It was developed by Minsang and introduced in the ICLR 2026 paper, "Explain in Your Own Words: Improving Reasoning via Token-Selective Dual Knowledge Distillation". The model's core innovation lies in its application of Token-Selective Dual Knowledge Distillation, a technique aimed at enhancing its reasoning abilities.

Key Capabilities

  • Enhanced Reasoning: The model is specifically trained and optimized to improve reasoning performance through a novel knowledge distillation method.
  • Qwen2.5 Base: Leverages the robust architecture of Qwen2.5, providing a strong foundation for language understanding and generation.
  • Long Context Window: Supports a substantial context length of 32768 tokens, enabling it to process and understand longer inputs.

Good For

  • Research in Reasoning: Ideal for researchers exploring advanced reasoning techniques in language models, particularly those interested in knowledge distillation methods.
  • Applications Requiring Improved Logic: Suitable for use cases where enhanced logical inference and explanation generation are critical.
  • Academic and Experimental Projects: A strong candidate for projects focusing on the practical application of the "Token-Selective Dual Knowledge Distillation" methodology. Further details can be found in the associated research paper.