embedl/Llama-3.2-3B-Instruct-FlashHead
embedl/Llama-3.2-3B-Instruct-FlashHead is an optimized version of the Llama-3.2-3B-Instruct model, developed by Embedl. It utilizes FlashHead, an efficient replacement for the language model head, to reduce model size while preserving accuracy. This model is specifically designed for low-latency inference on NVIDIA RTX GPUs, offering significant speedups over the baseline while maintaining accuracy parity on common benchmarks.
Loading preview...
Llama-3.2-3B-Instruct-FlashHead Overview
This model is an optimized version of Llama-3.2-3B-Instruct, developed by Embedl. It integrates FlashHead, an efficient language model head replacement, to achieve a smaller footprint without compromising accuracy. The primary focus of this optimization is to enable low-latency inference on NVIDIA RTX GPUs.
Key Optimizations & Performance
- FlashHead LM Head: A lightweight replacement that significantly improves throughput.
- vLLM Plugin Integration: Compatible with vLLM (0.14.0+) via the
flash-headplugin for enhanced performance. - Speedup: Achieves up to 1.26x end-to-end speedup over state-of-the-art methods, with FlashHead W4A16 delivering 177 tokens/sec compared to 54 tokens/sec for BF16 baseline on an RTX 3500 Ada.
- Accuracy Parity: Maintains accuracy within rounding error of the Llama-3.2-3B-Instruct baseline across benchmarks like MMLU-Pro, HellaSwag, GSM8K, IFEval, BBH, and TruthfulQA.
Intended Use Cases
- Text generation and reasoning tasks.
- Assistant-style interactions and general-purpose NLP.
- Ideal for deployments requiring high inference speed on NVIDIA RTX GPUs.
Limitations
- Requires vLLM >= 0.14.0.
- Currently optimized specifically for NVIDIA RTX GPUs.