Name: RedHatAI/Meta-Llama-3-8B-Instruct-FP8-KV API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: RedHatAI

Overview

RedHatAI/Meta-Llama-3-8B-Instruct-FP8-KV is an 8 billion parameter instruction-tuned model derived from Meta-Llama-3-8B-Instruct. Its primary distinction lies in its FP8 quantization for both model weights and activations, alongside an FP8 Key-Value (KV) Cache. This optimization is designed for highly efficient inference, particularly when used with vLLM (version 0.5.0 or newer), by reducing memory footprint and increasing throughput.

Key Capabilities

FP8 Quantization: Utilizes per-tensor FP8 quantization for weights and activations, enabling faster and more memory-efficient inference.
FP8 KV Cache: Incorporates FP8 quantization for the KV Cache, further enhancing inference efficiency and reducing memory usage.
vLLM Integration: Specifically prepared for seamless integration and optimized performance with the vLLM inference engine, requiring the --kv-cache-dtype fp8 argument.
Strong Performance: Despite aggressive quantization, the model retains competitive performance, scoring 74.98 on the gsm8k 5-shot benchmark, closely matching the unquantized base model.

Good For

Resource-Constrained Deployments: Ideal for environments where memory and computational resources are limited, but high inference speed is required.
High-Throughput Applications: Suitable for applications demanding rapid response times and processing a large volume of requests.
Efficient LLM Inference: Developers looking to leverage the Meta-Llama-3-8B-Instruct capabilities with significantly reduced operational costs and improved efficiency via FP8 quantization.