jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v3.0
jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v3.0 is a 0.6 billion parameter draft model for speculative decoding, derived from Qwen2.5-0.5B-Instruct. It is specifically designed to be used in conjunction with the Kimi-K2-Instruct model. This model supports an initial context length of 32768 tokens, with configurable extensions up to 128k tokens using YaRN scaling, making it suitable for applications requiring processing of very long contexts.
Loading preview...
Model Overview
jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v3.0 is a 0.6 billion parameter draft model, built upon the Qwen2.5-0.5B-Instruct architecture. Its primary purpose is to serve as a speculative decoding model for the Kimi-K2-Instruct series. The model was created by transplanting the vocabulary from Qwen2.5-0.5B-Instruct to align with Kimi-K2-Instruct's tokenizer, followed by fine-tuning.
Key Capabilities
- Speculative Decoding: Designed as a draft model to accelerate inference when paired with a larger Kimi-K2-Instruct model.
- Extended Context Length: Supports a default context window of 32,768 tokens, which can be extended to 65,536 or 131,072 tokens by modifying the
config.jsonwith YaRN scaling parameters. This makes it suitable for processing very long documents or conversations. - Training Data: Fine-tuned on approximately 2.3 billion tokens from diverse datasets including
agentlans/common-crawl-sample,bigcode/the-stack-smol-xl, andrombodawg/Everything_Instruct.
How it was created
- Vocabulary Transplant: The initial model was created from Qwen2.5-0.5B-Instruct using
transplant-vocabto align its tokenizer with Kimi-K2-Instruct, handling non-standard token overrides. - Fine-tuning: Trained for one epoch using
qlora-pipe-litewith a batch size of 60 and a sequence length of 32,768 tokens, utilizing six RTX A6000 GPUs. - GGUF Conversion: Includes specific modifications to
convert_hf_to_gguf.pyto addressTikToken/SentencePiecetokenizer mismatches forllama.cppcompatibility, enabling GGUF quantization.