A `0.6B` parameter draft (speculative decoding) model for use with [GLM-4.5](https://huggingface.co/zai-org/GLM-4.5), [GLM-4.5-Air](https://huggingface.co/zai-org/GLM-4.5-Air) and [GLM-4-32B-0414](https://huggingface.co/zai-org/GLM-4-32B-0414). See [GLM-4.5-DRAFT-0.6B-v3.0-GGUF](https://huggingface.co/jukofyork/GLM-4.5-DRAFT-0.6B-v3.0-GGUF) for the models in `gguf` format for use with `llama.cpp`. --- # Extending the context above 32k The current `config.json` is set for context length up to 32k tokens. Add the `"rope_scaling"` section to `config.json` to enable [YaRN](https://arxiv.org/abs/2309.00071), eg: ## To extend the context to 64k: ```json "max_position_embeddings": 65536, ... "rope_scaling": { "factor": 2.0, "original_max_position_embeddings": 32768, "type": "yarn" }, ``` ## To extend the context to 128k: ```json "max_position_embeddings": 131072, ... "rope_scaling": { "factor": 4.0, "original_max_position_embeddings": 32768, "type": "yarn" }, ``` **NOTE**: Because `llama.cpp` uses "static-YaRN" the scaling factor remains constant regardless of input length! Only add the `rope_scaling` configuration when processing long contexts is required... --- # How this model was created ## 1. The initial model was created from [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) using [transplant-vocab](https://github.com/jukofyork/transplant-vocab): ```sh > python ./transplant_vocab.py \ ./Qwen2.5-0.5B-Instruct \ ./GLM-4.5 \ ./GLM-4.5-DRAFT-0.6B-UNTRAINED \ --override "<|endoftext|>" "<|im_end|>" \ --override "[MASK]" "<|endoftext|>" \ --override "[gMASK]" "<|endoftext|>" \ --override "[sMASK]" "<|endoftext|>" \ --override "" "<|endoftext|>" \ --override "" "<|endoftext|>" \ --override "<|system|>" "<|im_start|>system" \ --override "<|user|>" "<|im_end|><|im_start|>user" \ --override "<|assistant|>" "<|im_start|>assistant" \ --override "<|observation|>" "<|im_end|><|im_start|>user" \ --override "<|code_prefix|>" "<|fim_prefix|>" \ --override "<|code_middle|>" "<|fim_middle|>" \ --override "<|code_suffix|>" "<|fim_suffix|>" Loading config from 'Qwen2.5-0.5B-Instruct'... Done. Loading config from 'GLM-4.5'... Done. Loading tokenizer from 'Qwen2.5-0.5B-Instruct'... Done. Loading tokenizer from 'GLM-4.5'... Done. Loading model from 'Qwen2.5-0.5B-Instruct'... Done. Input model configuration: - Target vocabulary size : 151552 (used = 151365, unused = 187) - Donor vocabulary size : 151936 - Donor num layers : 24 (tied embeddings = True) - Donor hidden size : 896 - Donor attention heads : 14 - Donor intermediate size : 4864 (ratio = 1:5.4) - Donor total parameters : 494032768 (0.49B) -- Embedding parameters : 136134656 (0.14B) -- Non-embedding parameters : 357898112 (0.36B) Processing 3 automatic token overrides: ✘ 'bos_token_id' : Not found for target model ✔ 'eos_token_id' : 151329 '<|endoftext|>' → [151645] '<|im_end|>' ✘ 'pad_token_id' : 151329 is already mapped to [151645] Processing 13 manual token overrides: ✔ 151329 : '<|endoftext|>' → [151645] '<|im_end|>' ✔ 151330 : '[MASK]' → [151643] '<|endoftext|>' ✔ 151331 : '[gMASK]' → [151643] '<|endoftext|>' ✔ 151332 : '[sMASK]' → [151643] '<|endoftext|>' ✔ 151333 : '' → [151643] '<|endoftext|>' ✔ 151334 : '' → [151643] '<|endoftext|>' ✔ 151335 : '<|system|>' → [151644, 8948] '<|im_start|>system' ✔ 151336 : '<|user|>' → [151645, 151644, 872] '<|im_end|><|im_start|>user' ✔ 151337 : '<|assistant|>' → [151644, 77091] '<|im_start|>assistant' ✔ 151338 : '<|observation|>' → [151645, 151644, 872] '<|im_end|><|im_start|>user' ✔ 151347 : '<|code_prefix|>' → [151659] '<|fim_prefix|>' ✔ 151348 : '<|code_middle|>' → [151660] '<|fim_middle|>' ✔ 151349 : '<|code_suffix|>' → [151661] '<|fim_suffix|>' NOTE: Using an "untied" copy of 'embed_tokens.weight' as new 'lm_head.weight' tensor... Transplanting tokens: 100%|████████████████████████████████████████████████████████████| 151365/151365 [00:44<00:00, 3376.18token/s] Transplant mappings: - 1 to 1 : 123103 (81%) - 2 to 1 : 23942 (16%) - 3 to 1 : 3264 (2.2%) - 4 to 1 : 823 (0.54%) - 5 to 1 : 181 (0.12%) - 6 to 1 : 23 (0.015%) - 7 to 1 : 21 (0.014%) - 8 to 1 : 5 (0.0033%) - 9 to 1 : 1 (0.00066%) - 13 to 1 : 1 (0.00066%) - 16 to 1 : 1 (0.00066%) Head initialized with: - Copies : 123103 (81%) - Means : 28262 (19%) - Zeros : 187 (0.12%) Output model configuration: - Output vocabulary size : 151552 - Output num layers : 24 (tied embeddings = False) - Output hidden size : 896 - Output attention heads : 14 - Output intermediate size : 4864 (ratio = 1:5.4) - Output total parameters : 629479296 (0.63B) -- Embedding parameters : 271581184 (0.27B) -- Non-embedding parameters : 357898112 (0.36B) Saving model and tokenizer to '/mnt/raid_drive/draft_finetunes/models/GLM-4.5-DRAFT-0.6B-UNTRAINED' folder [2025-08-07 15:37:42,208] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) Patching 'torch_dtype' in '/mnt/raid_drive/draft_finetunes/models/GLM-4.5-DRAFT-0.6B-UNTRAINED/config.json' based on actual saved tensors - Updated 'torch_dtype' to 'bfloat16' based on actual tensor dtype Operation completed successfully (ignore any 'segmentation fault' that follows!!!) ``` **NOTE**: After this, I had to manually add `"bos_token_id": 151331` to `config.json` to get it to match the tokeniser when used in `llama.cpp` as a draft model. ## 2. The following datasets were used to create a fine-tuning dataset of ~2.3B tokens: - [agentlans/common-crawl-sample](https://huggingface.co/datasets/agentlans/common-crawl-sample) - [bigcode/the-stack-smol-xl](https://huggingface.co/datasets/bigcode/the-stack-smol-xl) - [rombodawg/Everything_Instruct](https://huggingface.co/datasets/rombodawg/Everything_Instruct) (NOTE: `output` field only) formatted just between `<|endoftext|>` tags. ## 3. The model was then trained using [qlora-pipe-lite](https://github.com/jukofyork/qlora-pipe-lite) for 1 epoch with a batch size of 60 and a sequence length of 32k (~2M tokens per step): ```toml # ============================== # MODEL AND OUTPUT CONFIGURATION # ============================== model_dir = 'models/GLM-4.5-DRAFT-0.6B-UNTRAINED' output_dir = 'finetuned' # =========================== # TRAINING TYPE CONFIGURATION # =========================== full_fine_tune = true # ======================= # OPTIMIZER CONFIGURATION # ======================= lr = 5e-5 # ====================== # TRAINING CONFIGURATION # ====================== sequence_len = 32768 gradient_accumulation_steps = 10 # 10×6 = batch size 60, 10×6×32768 = ~2M tokens per step # ===================== # DATASET CONFIGURATION # ===================== [[datasets]] dataset_path = 'datasets/common-crawl-sample/*.json' drop_tails = true [[datasets]] dataset_path = 'datasets/the-stack-smol-xl/*.jsonl' drop_tails = true [[datasets]] dataset_path = 'datasets/rombodawg-Everything-Instruct/*.json' drop_tails = true ``` I used six `RTX A6000` GPUs over three nodes and hence the `60` batch size (`6 x 10 gradient accumulation steps = 60`). ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/X8FCfXiW3kw_mZZAz1d7X.png)