RedHatAI/Meta-Llama-3-8B-Instruct-FP8-KV

Warm
Public
8B
FP8
8192
1
May 20, 2024
Hugging Face
Overview

Overview

RedHatAI/Meta-Llama-3-8B-Instruct-FP8-KV is an 8 billion parameter instruction-tuned model derived from Meta-Llama-3-8B-Instruct. Its primary distinction lies in its FP8 quantization for both model weights and activations, alongside an FP8 Key-Value (KV) Cache. This optimization is designed for highly efficient inference, particularly when used with vLLM (version 0.5.0 or newer), by reducing memory footprint and increasing throughput.

Key Capabilities

  • FP8 Quantization: Utilizes per-tensor FP8 quantization for weights and activations, enabling faster and more memory-efficient inference.
  • FP8 KV Cache: Incorporates FP8 quantization for the KV Cache, further enhancing inference efficiency and reducing memory usage.
  • vLLM Integration: Specifically prepared for seamless integration and optimized performance with the vLLM inference engine, requiring the --kv-cache-dtype fp8 argument.
  • Strong Performance: Despite aggressive quantization, the model retains competitive performance, scoring 74.98 on the gsm8k 5-shot benchmark, closely matching the unquantized base model.

Good For

  • Resource-Constrained Deployments: Ideal for environments where memory and computational resources are limited, but high inference speed is required.
  • High-Throughput Applications: Suitable for applications demanding rapid response times and processing a large volume of requests.
  • Efficient LLM Inference: Developers looking to leverage the Meta-Llama-3-8B-Instruct capabilities with significantly reduced operational costs and improved efficiency via FP8 quantization.