RedHatAI/Meta-Llama-3-70B-Instruct-FP8

TEXT GENERATIONConcurrency Cost:4Model Size:70BQuant:FP8Ctx Length:8kPublished:May 24, 2024License:llama3Architecture:Transformer0.0K Cold

RedHatAI/Meta-Llama-3-70B-Instruct-FP8 is a 70 billion parameter, FP8 quantized version of Meta-Llama-3-70B-Instruct, developed by Neural Magic. This model significantly reduces disk size and GPU memory requirements by quantizing weights and activations to FP8. It is optimized for assistant-like chat in English, achieving an average OpenLLM benchmark score of 79.16 with minimal accuracy loss compared to its unquantized counterpart.

Loading preview...

Model Overview

RedHatAI/Meta-Llama-3-70B-Instruct-FP8 is a 70 billion parameter, instruction-tuned causal language model based on the Meta-Llama-3 architecture, developed by Neural Magic. This model is a quantized version of the original Meta-Llama-3-70B-Instruct, with both weights and activations quantized to FP8 data types. This optimization significantly reduces the model's disk size and GPU memory footprint by approximately 50% while maintaining strong performance.

Key Capabilities & Optimizations

  • FP8 Quantization: Utilizes FP8 quantization for weights and activations, making it highly efficient for deployment with backends like vLLM.
  • Performance Retention: Achieves an average score of 79.16 on the OpenLLM benchmark, demonstrating excellent recovery (99.55%) compared to the unquantized model's 79.51.
  • Reduced Resource Footprint: Ideal for environments where memory and storage are critical, enabling more efficient inference.

Intended Use Cases

  • Assistant-like Chat: Primarily designed for commercial and research applications involving assistant-like conversational AI in English.
  • Efficient Deployment: Suitable for developers looking to deploy a powerful 70B model with reduced hardware requirements, especially with vLLM.

Evaluation Highlights

Evaluated on the OpenLLM Leaderboard, the FP8 quantized model shows strong performance across various benchmarks, including MMLU (80.06), ARC Challenge (72.61), and GSM-8K (91.12), with minimal degradation from the original FP16 model.