Overview
Overview
RedHatAI/Meta-Llama-3-8B-Instruct-FP8-KV is an 8 billion parameter instruction-tuned model derived from Meta-Llama-3-8B-Instruct. Its primary distinction lies in its FP8 quantization for both model weights and activations, alongside an FP8 Key-Value (KV) Cache. This optimization is designed for highly efficient inference, particularly when used with vLLM (version 0.5.0 or newer), by reducing memory footprint and increasing throughput.
Key Capabilities
- FP8 Quantization: Utilizes per-tensor FP8 quantization for weights and activations, enabling faster and more memory-efficient inference.
- FP8 KV Cache: Incorporates FP8 quantization for the KV Cache, further enhancing inference efficiency and reducing memory usage.
- vLLM Integration: Specifically prepared for seamless integration and optimized performance with the vLLM inference engine, requiring the
--kv-cache-dtype fp8argument. - Strong Performance: Despite aggressive quantization, the model retains competitive performance, scoring 74.98 on the gsm8k 5-shot benchmark, closely matching the unquantized base model.
Good For
- Resource-Constrained Deployments: Ideal for environments where memory and computational resources are limited, but high inference speed is required.
- High-Throughput Applications: Suitable for applications demanding rapid response times and processing a large volume of requests.
- Efficient LLM Inference: Developers looking to leverage the Meta-Llama-3-8B-Instruct capabilities with significantly reduced operational costs and improved efficiency via FP8 quantization.