embedl/Llama-3.2-3B-Instruct-FlashHead

TEXT GENERATIONConcurrency Cost:1Model Size:3.2BQuant:BF16Ctx Length:32kPublished:Dec 8, 2025License:embedl-models-community-licence-1.0Architecture:Transformer0.0K Cold

embedl/Llama-3.2-3B-Instruct-FlashHead is a 3 billion parameter instruction-tuned causal language model developed by Embedl. It is an optimized version of Llama-3.2-3B-Instruct, utilizing the FlashHead LM head to reduce size and improve throughput while maintaining accuracy. This model is specifically designed for low-latency inference on NVIDIA RTX GPUs, excelling in text generation, reasoning, and assistant-style interactions.

Loading preview...

Overview

embedl/Llama-3.2-3B-Instruct-FlashHead is an optimized version of the Llama-3.2-3B-Instruct base model, developed by Embedl. It integrates FlashHead, Embedl's efficient replacement for the traditional language model head, which significantly improves throughput and reduces size while preserving accuracy. This model is specifically engineered for low-latency inference on NVIDIA RTX GPUs.

Key Optimizations & Performance

  • FlashHead LM Head: A lightweight replacement for the dense LM head, leading to improved throughput without accuracy degradation.
  • vLLM Plugin Integration: Fully compatible with vLLM (0.14.0+) via the flash-head plugin, which activates automatically.
  • Speedup: Achieves a 1.07x speedup over BF16 baseline and a 1.26x speedup over state-of-the-art W4A16 quantization, reaching 177 tokens/sec on an RTX 3500 Ada with W4A16 FlashHead.

Accuracy & Use Cases

  • Accuracy Parity: FlashHead closely matches the baseline Llama-3.2-3B-Instruct model's accuracy across benchmarks like MMLU-Pro, IFEval, BBH, TruthfulQA, and GSM8K.
  • Intended Use: Ideal for text generation, reasoning, assistant-style interaction, and general-purpose NLP tasks where low-latency on NVIDIA RTX GPUs is critical.

Limitations

  • Requires vLLM >= 0.14.0.
  • Currently optimized for NVIDIA RTX GPUs.