Name: embedl/Llama-3.2-3B-Instruct-FlashHead API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: embedl

Llama-3.2-3B-Instruct-FlashHead Overview

This model is an optimized version of Llama-3.2-3B-Instruct, developed by Embedl. It integrates FlashHead, an efficient language model head replacement, to achieve a smaller footprint without compromising accuracy. The primary focus of this optimization is to enable low-latency inference on NVIDIA RTX GPUs.

Key Optimizations & Performance

FlashHead LM Head: A lightweight replacement that significantly improves throughput.
vLLM Plugin Integration: Compatible with vLLM (0.14.0+) via the flash-head plugin for enhanced performance.
Speedup: Achieves up to 1.26x end-to-end speedup over state-of-the-art methods, with FlashHead W4A16 delivering 177 tokens/sec compared to 54 tokens/sec for BF16 baseline on an RTX 3500 Ada.
Accuracy Parity: Maintains accuracy within rounding error of the Llama-3.2-3B-Instruct baseline across benchmarks like MMLU-Pro, HellaSwag, GSM8K, IFEval, BBH, and TruthfulQA.

Intended Use Cases

Text generation and reasoning tasks.
Assistant-style interactions and general-purpose NLP.
Ideal for deployments requiring high inference speed on NVIDIA RTX GPUs.

Limitations

Requires vLLM >= 0.14.0.
Currently optimized specifically for NVIDIA RTX GPUs.

Overview

Llama-3.2-3B-Instruct-FlashHead Overview

Key Optimizations & Performance

Intended Use Cases

Limitations

Full Model Card (README)