RedHatAI/Meta-Llama-3-8B-Instruct-FP8-KV
RedHatAI/Meta-Llama-3-8B-Instruct-FP8-KV is an 8 billion parameter instruction-tuned causal language model, developed by RedHatAI, based on Meta-Llama-3. This model is specifically quantized to FP8 for both weights and activations, and includes FP8 KV Cache, optimizing it for efficient inference with vLLM. It maintains strong performance, achieving 74.98 on gsm8k 5-shot, making it suitable for resource-constrained environments requiring high throughput.
Loading preview...
Overview
RedHatAI/Meta-Llama-3-8B-Instruct-FP8-KV is an 8 billion parameter instruction-tuned model derived from Meta-Llama-3-8B-Instruct. Its primary distinction lies in its FP8 quantization for both model weights and activations, alongside an FP8 Key-Value (KV) Cache. This optimization is designed for highly efficient inference, particularly when used with vLLM (version 0.5.0 or newer), by reducing memory footprint and increasing throughput.
Key Capabilities
- FP8 Quantization: Utilizes per-tensor FP8 quantization for weights and activations, enabling faster and more memory-efficient inference.
- FP8 KV Cache: Incorporates FP8 quantization for the KV Cache, further enhancing inference efficiency and reducing memory usage.
- vLLM Integration: Specifically prepared for seamless integration and optimized performance with the vLLM inference engine, requiring the
--kv-cache-dtype fp8argument. - Strong Performance: Despite aggressive quantization, the model retains competitive performance, scoring 74.98 on the gsm8k 5-shot benchmark, closely matching the unquantized base model.
Good For
- Resource-Constrained Deployments: Ideal for environments where memory and computational resources are limited, but high inference speed is required.
- High-Throughput Applications: Suitable for applications demanding rapid response times and processing a large volume of requests.
- Efficient LLM Inference: Developers looking to leverage the Meta-Llama-3-8B-Instruct capabilities with significantly reduced operational costs and improved efficiency via FP8 quantization.