embedl/Llama-3.2-3B-Instruct-FlashHead

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:3.2BQuant:BF16Ctx Length:32kPublished:Dec 8, 2025License:embedl-models-community-licence-1.0Architecture:Transformer0.0K Gated Warm

embedl/Llama-3.2-3B-Instruct-FlashHead is an optimized version of the Llama-3.2-3B-Instruct model, developed by Embedl. It utilizes FlashHead, an efficient replacement for the language model head, to reduce model size while preserving accuracy. This model is specifically designed for low-latency inference on NVIDIA RTX GPUs, offering significant speedups over the baseline while maintaining accuracy parity on common benchmarks.

Loading preview...

Llama-3.2-3B-Instruct-FlashHead Overview

This model is an optimized version of Llama-3.2-3B-Instruct, developed by Embedl. It integrates FlashHead, an efficient language model head replacement, to achieve a smaller footprint without compromising accuracy. The primary focus of this optimization is to enable low-latency inference on NVIDIA RTX GPUs.

Key Optimizations & Performance

  • FlashHead LM Head: A lightweight replacement that significantly improves throughput.
  • vLLM Plugin Integration: Compatible with vLLM (0.14.0+) via the flash-head plugin for enhanced performance.
  • Speedup: Achieves up to 1.26x end-to-end speedup over state-of-the-art methods, with FlashHead W4A16 delivering 177 tokens/sec compared to 54 tokens/sec for BF16 baseline on an RTX 3500 Ada.
  • Accuracy Parity: Maintains accuracy within rounding error of the Llama-3.2-3B-Instruct baseline across benchmarks like MMLU-Pro, HellaSwag, GSM8K, IFEval, BBH, and TruthfulQA.

Intended Use Cases

  • Text generation and reasoning tasks.
  • Assistant-style interactions and general-purpose NLP.
  • Ideal for deployments requiring high inference speed on NVIDIA RTX GPUs.

Limitations

  • Requires vLLM >= 0.14.0.
  • Currently optimized specifically for NVIDIA RTX GPUs.