turboderp/Qwama-0.5B-Instruct
Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Jun 13, 2024License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

turboderp/Qwama-0.5B-Instruct is a 0.5 billion parameter instruction-tuned causal language model based on Qwen2-0.5B-Instruct, featuring a Llama-3 vocabulary. Developed by turboderp, this model is primarily intended as a lightweight draft model for larger Llama-3-70B-Instruct for speculative decoding. It also serves as an exploration into the feasibility and cost of vocabulary swaps between dissimilar language models.

Loading preview...

turboderp/Qwama-0.5B-Instruct Overview

Qwama-0.5B-Instruct is a 0.5 billion parameter instruction-tuned model, a modified version of Qwen2-0.5B-Instruct with a Llama-3 vocabulary. Its primary purpose is to act as a lightweight draft model for speculative decoding with larger models like Llama-3-70B-Instruct, offering a less resource-intensive alternative to Llama3-8B-Instruct for this role.

Key Features and Development

  • Vocabulary Swap: The model's unique characteristic is its Llama-3 vocabulary, achieved by creating a new embedding layer and initializing it based on corresponding Qwen2 token embeddings. This process involved mapping Llama-3 tokens to Qwen2 tokens, averaging embeddings for multi-token matches.
  • Finetuning: After the vocabulary swap, the model underwent finetuning to restore coherence. This included training on a 2.41 million row sample from Common Crawl, followed by three epochs on approximately 25,000 instruct-formatted completions generated by Llama3-8B-Instruct.
  • Performance: While the vocabulary swap initially led to some degradation (e.g., Wikitext 2k perplexity increased from 12.57 to 15.34, MMLU dropped from 43.83% to 40.37% compared to the base Qwen2-0.5B-Instruct), the model demonstrates effective speculative decoding. When used as a draft model for Llama3-70B-Instruct, it achieves 3.72x speedup for code and 1.92x for prose, outperforming Qwen2-0.5B-Instruct as a draft for Qwen2-72B-Instruct.

Use Cases

  • Speculative Decoding: Ideal for accelerating inference on larger Llama-3 models by serving as a fast, lightweight draft model.
  • Research: Provides a practical example for exploring the viability and challenges of vocabulary transplantation between different language models.