nvidia/Llama-3.1-8B-Instruct-FP8

Warm
Public
8B
FP8
32768
License: llama3.1
Hugging Face
Overview

Model Overview

The nvidia/Llama-3.1-8B-Instruct-FP8 is an 8 billion parameter instruction-tuned language model, a quantized version of Meta's Llama 3.1 8B Instruct. Developed by NVIDIA, this model utilizes FP8 quantization via the TensorRT Model Optimizer to enhance inference efficiency.

Key Features & Optimizations

  • FP8 Quantization: Weights and activations of linear operators within transformer blocks are quantized to FP8, reducing disk size and GPU memory requirements by approximately 50%.
  • Performance: Achieves a ~1.3x speedup on H100 GPUs compared to its BF16 counterpart, while maintaining competitive performance on benchmarks such as MMLU (68.7 FP8 vs 69.4 BF16) and GSM8K (83.1 FP8 vs 84.5 BF16).
  • Architecture: Based on the Llama3.1 transformer architecture, supporting text input and output with a context length up to 128K.
  • Hardware Compatibility: Optimized for NVIDIA Blackwell, Hopper, and Lovelace microarchitectures.
  • Deployment: Ready for deployment with TensorRT-LLM and vLLM, offering flexible integration options.

Intended Use Cases

This model is ideal for developers seeking a high-performance, resource-efficient instruction-tuned language model for various text generation tasks. Its FP8 quantization makes it particularly suitable for applications where memory footprint and inference speed are critical, such as edge deployments or large-scale inference on NVIDIA hardware.