Name: turboderp/Qwama-0.5B-Instruct API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: turboderp

turboderp/Qwama-0.5B-Instruct Overview

Qwama-0.5B-Instruct is a 0.5 billion parameter instruction-tuned model, a modified version of Qwen2-0.5B-Instruct with a Llama-3 vocabulary. Its primary purpose is to act as a lightweight draft model for speculative decoding with larger models like Llama-3-70B-Instruct, offering a less resource-intensive alternative to Llama3-8B-Instruct for this role.

Key Features and Development

Vocabulary Swap: The model's unique characteristic is its Llama-3 vocabulary, achieved by creating a new embedding layer and initializing it based on corresponding Qwen2 token embeddings. This process involved mapping Llama-3 tokens to Qwen2 tokens, averaging embeddings for multi-token matches.
Finetuning: After the vocabulary swap, the model underwent finetuning to restore coherence. This included training on a 2.41 million row sample from Common Crawl, followed by three epochs on approximately 25,000 instruct-formatted completions generated by Llama3-8B-Instruct.
Performance: While the vocabulary swap initially led to some degradation (e.g., Wikitext 2k perplexity increased from 12.57 to 15.34, MMLU dropped from 43.83% to 40.37% compared to the base Qwen2-0.5B-Instruct), the model demonstrates effective speculative decoding. When used as a draft model for Llama3-70B-Instruct, it achieves 3.72x speedup for code and 1.92x for prose, outperforming Qwen2-0.5B-Instruct as a draft for Qwen2-72B-Instruct.

Use Cases

Speculative Decoding: Ideal for accelerating inference on larger Llama-3 models by serving as a fast, lightweight draft model.
Research: Provides a practical example for exploring the viability and challenges of vocabulary transplantation between different language models.

Overview

turboderp/Qwama-0.5B-Instruct Overview

Key Features and Development

Use Cases

Full Model Card (README)