RedHatAI/Meta-Llama-3-8B-Instruct-FP8
The RedHatAI/Meta-Llama-3-8B-Instruct-FP8 model, developed by Neural Magic, is an 8 billion parameter instruction-tuned Llama-3 architecture optimized with FP8 weight and activation quantization. This optimization significantly reduces memory footprint and disk size by approximately 50% compared to its unquantized counterpart. Intended for commercial and research use, it excels in English-language assistant-like chat applications while maintaining strong performance, achieving an average OpenLLM benchmark score of 68.22.
Loading preview...
Model Overview
This model, Meta-Llama-3-8B-Instruct-FP8, is a quantized version of the Meta-Llama-3-8B-Instruct, developed by Neural Magic. It features an 8 billion parameter Llama-3 architecture, specifically optimized using FP8 quantization for both weights and activations. This process reduces the model's memory and disk footprint by approximately 50%, making it more efficient for deployment.
Key Optimizations and Performance
- FP8 Quantization: Weights and activations of linear operators within transformer blocks are quantized to FP8, enabling efficient inference with vLLM (version 0.5.0 or later).
- Performance: The model achieves an average score of 68.22 on the OpenLLM benchmark (version 1), closely matching the unquantized model's 68.71. This represents a recovery rate of 99.28% across various benchmarks like MMLU, ARC Challenge, and GSM-8K.
- Creation: Quantization was performed using AutoFP8 with 512 UltraChat sequences for calibration.
Intended Use Cases
- Assistant-like Chat: Designed for commercial and research applications requiring English-language assistant capabilities.
- Efficient Deployment: Ideal for scenarios where reduced GPU memory and faster inference are critical, especially when deployed with vLLM.
Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.