jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v3.0

TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Aug 10, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v3.0 is a 0.6 billion parameter draft model for speculative decoding, derived from Qwen2.5-0.5B-Instruct. It is specifically designed to be used in conjunction with the Kimi-K2-Instruct model. This model supports an initial context length of 32768 tokens, with configurable extensions up to 128k tokens using YaRN scaling, making it suitable for applications requiring processing of very long contexts.

Loading preview...

Model Overview

jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v3.0 is a 0.6 billion parameter draft model, built upon the Qwen2.5-0.5B-Instruct architecture. Its primary purpose is to serve as a speculative decoding model for the Kimi-K2-Instruct series. The model was created by transplanting the vocabulary from Qwen2.5-0.5B-Instruct to align with Kimi-K2-Instruct's tokenizer, followed by fine-tuning.

Key Capabilities

  • Speculative Decoding: Designed as a draft model to accelerate inference when paired with a larger Kimi-K2-Instruct model.
  • Extended Context Length: Supports a default context window of 32,768 tokens, which can be extended to 65,536 or 131,072 tokens by modifying the config.json with YaRN scaling parameters. This makes it suitable for processing very long documents or conversations.
  • Training Data: Fine-tuned on approximately 2.3 billion tokens from diverse datasets including agentlans/common-crawl-sample, bigcode/the-stack-smol-xl, and rombodawg/Everything_Instruct.

How it was created

  1. Vocabulary Transplant: The initial model was created from Qwen2.5-0.5B-Instruct using transplant-vocab to align its tokenizer with Kimi-K2-Instruct, handling non-standard token overrides.
  2. Fine-tuning: Trained for one epoch using qlora-pipe-lite with a batch size of 60 and a sequence length of 32,768 tokens, utilizing six RTX A6000 GPUs.
  3. GGUF Conversion: Includes specific modifications to convert_hf_to_gguf.py to address TikToken / SentencePiece tokenizer mismatches for llama.cpp compatibility, enabling GGUF quantization.