nvidia/Llama-3.1-70B-Instruct-FP8
TEXT GENERATIONConcurrency Cost:4Model Size:70BQuant:FP8Ctx Length:32kPublished:Aug 29, 2024License:llama3.1Architecture:Transformer0.0K Cold

The NVIDIA Llama 3.1 70B Instruct FP8 model is a quantized version of Meta's Llama 3.1 70B Instruct, an auto-regressive language model with 70 billion parameters and a 32K context length. This model is optimized for efficient inference, reducing disk size and GPU memory requirements by approximately 50% through FP8 quantization. It achieves a 1.5x speedup on H100 GPUs compared to its BF16 counterpart, making it suitable for high-throughput text generation tasks.

Loading preview...

NVIDIA Llama 3.1 70B Instruct FP8 Overview

This model is NVIDIA's FP8 quantized version of Meta's Llama 3.1 70B Instruct, an auto-regressive language model built on an optimized transformer architecture. It features 70 billion parameters and supports a context length of up to 128K tokens (though the base model is 32K).

Key Characteristics & Optimizations

  • Quantization: Weights and activations are quantized to FP8 data type using TensorRT Model Optimizer, significantly reducing model size and GPU memory footprint.
  • Performance Boost: Achieves approximately 1.5x inference speedup on NVIDIA H100 GPUs compared to the BF16 precision version, with minimal impact on accuracy.
  • Hardware Compatibility: Optimized for NVIDIA Blackwell, Hopper, and Lovelace architectures.
  • Software Integration: Supports deployment with TensorRT-LLM and vLLM, including specific quantization=modelopt flag for vLLM.

Performance Benchmarks (FP8 vs. BF16)

Precision MMLU GSM8K (CoT) ARC Challenge IFEVAL TPS
BF16 83.3 95.3 93.7 92.1 1356.92
FP8 83.2 94.3 93.2 92.2 2040.30

Ideal Use Cases

  • High-throughput inference: When deploying large language models where speed and memory efficiency are critical.
  • Resource-constrained environments: For applications requiring a powerful 70B model with reduced hardware demands.
  • Text generation and instruction following: Leveraging the capabilities of the Llama 3.1 Instruct base model for various NLP tasks.