Name: nvidia/Llama-3.1-70B-Instruct-FP8 API
Brand: Featherless.ai
Price: 25.00 USD
Availability: InStock
Author: nvidia

NVIDIA Llama 3.1 70B Instruct FP8 Overview

This model is NVIDIA's FP8 quantized version of Meta's Llama 3.1 70B Instruct, an auto-regressive language model built on an optimized transformer architecture. It features 70 billion parameters and supports a context length of up to 128K tokens (though the base model is 32K).

Key Characteristics & Optimizations

Quantization: Weights and activations are quantized to FP8 data type using TensorRT Model Optimizer, significantly reducing model size and GPU memory footprint.
Performance Boost: Achieves approximately 1.5x inference speedup on NVIDIA H100 GPUs compared to the BF16 precision version, with minimal impact on accuracy.
Hardware Compatibility: Optimized for NVIDIA Blackwell, Hopper, and Lovelace architectures.
Software Integration: Supports deployment with TensorRT-LLM and vLLM, including specific quantization=modelopt flag for vLLM.

Performance Benchmarks (FP8 vs. BF16)

Precision	MMLU	GSM8K (CoT)	ARC Challenge	IFEVAL	TPS
BF16	83.3	95.3	93.7	92.1	1356.92
FP8	83.2	94.3	93.2	92.2	2040.30

Ideal Use Cases

High-throughput inference: When deploying large language models where speed and memory efficiency are critical.
Resource-constrained environments: For applications requiring a powerful 70B model with reduced hardware demands.
Text generation and instruction following: Leveraging the capabilities of the Llama 3.1 Instruct base model for various NLP tasks.

Overview

NVIDIA Llama 3.1 70B Instruct FP8 Overview

Key Characteristics & Optimizations

Performance Benchmarks (FP8 vs. BF16)

Ideal Use Cases

Full Model Card (README)