RedHatAI/Meta-Llama-3-8B-Instruct-FP8

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 25, 2024License:llama3Architecture:Transformer0.0K Warm

The RedHatAI/Meta-Llama-3-8B-Instruct-FP8 model, developed by Neural Magic, is an 8 billion parameter instruction-tuned Llama-3 architecture optimized with FP8 weight and activation quantization. This optimization significantly reduces memory footprint and disk size by approximately 50% compared to its unquantized counterpart. Intended for commercial and research use, it excels in English-language assistant-like chat applications while maintaining strong performance, achieving an average OpenLLM benchmark score of 68.22.

Loading preview...

Model Overview

This model, Meta-Llama-3-8B-Instruct-FP8, is a quantized version of the Meta-Llama-3-8B-Instruct, developed by Neural Magic. It features an 8 billion parameter Llama-3 architecture, specifically optimized using FP8 quantization for both weights and activations. This process reduces the model's memory and disk footprint by approximately 50%, making it more efficient for deployment.

Key Optimizations and Performance

  • FP8 Quantization: Weights and activations of linear operators within transformer blocks are quantized to FP8, enabling efficient inference with vLLM (version 0.5.0 or later).
  • Performance: The model achieves an average score of 68.22 on the OpenLLM benchmark (version 1), closely matching the unquantized model's 68.71. This represents a recovery rate of 99.28% across various benchmarks like MMLU, ARC Challenge, and GSM-8K.
  • Creation: Quantization was performed using AutoFP8 with 512 UltraChat sequences for calibration.

Intended Use Cases

  • Assistant-like Chat: Designed for commercial and research applications requiring English-language assistant capabilities.
  • Efficient Deployment: Ideal for scenarios where reduced GPU memory and faster inference are critical, especially when deployed with vLLM.

Popular Sampler Settings

Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.

temperature
top_p
top_k
frequency_penalty
presence_penalty
repetition_penalty
min_p