The NVIDIA Llama 3.1 70B Instruct FP8 model is a quantized version of Meta's Llama 3.1 70B Instruct, an auto-regressive language model with 70 billion parameters and a 32K context length. This model is optimized for efficient inference, reducing disk size and GPU memory requirements by approximately 50% through FP8 quantization. It achieves a 1.5x speedup on H100 GPUs compared to its BF16 counterpart, making it suitable for high-throughput text generation tasks.
Loading preview...
NVIDIA Llama 3.1 70B Instruct FP8 Overview
This model is NVIDIA's FP8 quantized version of Meta's Llama 3.1 70B Instruct, an auto-regressive language model built on an optimized transformer architecture. It features 70 billion parameters and supports a context length of up to 128K tokens (though the base model is 32K).
Key Characteristics & Optimizations
- Quantization: Weights and activations are quantized to FP8 data type using TensorRT Model Optimizer, significantly reducing model size and GPU memory footprint.
- Performance Boost: Achieves approximately 1.5x inference speedup on NVIDIA H100 GPUs compared to the BF16 precision version, with minimal impact on accuracy.
- Hardware Compatibility: Optimized for NVIDIA Blackwell, Hopper, and Lovelace architectures.
- Software Integration: Supports deployment with TensorRT-LLM and vLLM, including specific
quantization=modeloptflag for vLLM.
Performance Benchmarks (FP8 vs. BF16)
| Precision | MMLU | GSM8K (CoT) | ARC Challenge | IFEVAL | TPS |
|---|---|---|---|---|---|
| BF16 | 83.3 | 95.3 | 93.7 | 92.1 | 1356.92 |
| FP8 | 83.2 | 94.3 | 93.2 | 92.2 | 2040.30 |
Ideal Use Cases
- High-throughput inference: When deploying large language models where speed and memory efficiency are critical.
- Resource-constrained environments: For applications requiring a powerful 70B model with reduced hardware demands.
- Text generation and instruction following: Leveraging the capabilities of the Llama 3.1 Instruct base model for various NLP tasks.