nvidia/Llama-3.1-8B-Instruct-FP8
The nvidia/Llama-3.1-8B-Instruct-FP8 model is an 8 billion parameter instruction-tuned language model, quantized to FP8 precision by NVIDIA using TensorRT Model Optimizer. This model is derived from Meta's Llama 3.1 8B Instruct and is optimized for efficient inference on NVIDIA hardware, offering approximately 1.3x speedup on H100 GPUs. It maintains strong performance across benchmarks like MMLU and GSM8K while significantly reducing memory footprint. This model is suitable for commercial and non-commercial use in applications requiring fast, resource-efficient text generation.
Loading preview...
Model Overview
The nvidia/Llama-3.1-8B-Instruct-FP8 is an 8 billion parameter instruction-tuned language model, a quantized version of Meta's Llama 3.1 8B Instruct. Developed by NVIDIA, this model utilizes FP8 quantization via the TensorRT Model Optimizer to enhance inference efficiency.
Key Features & Optimizations
- FP8 Quantization: Weights and activations of linear operators within transformer blocks are quantized to FP8, reducing disk size and GPU memory requirements by approximately 50%.
- Performance: Achieves a ~1.3x speedup on H100 GPUs compared to its BF16 counterpart, while maintaining competitive performance on benchmarks such as MMLU (68.7 FP8 vs 69.4 BF16) and GSM8K (83.1 FP8 vs 84.5 BF16).
- Architecture: Based on the Llama3.1 transformer architecture, supporting text input and output with a context length up to 128K.
- Hardware Compatibility: Optimized for NVIDIA Blackwell, Hopper, and Lovelace microarchitectures.
- Deployment: Ready for deployment with TensorRT-LLM and vLLM, offering flexible integration options.
Intended Use Cases
This model is ideal for developers seeking a high-performance, resource-efficient instruction-tuned language model for various text generation tasks. Its FP8 quantization makes it particularly suitable for applications where memory footprint and inference speed are critical, such as edge deployments or large-scale inference on NVIDIA hardware.
Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.