Name: RedHatAI/Meta-Llama-3-70B-Instruct-FP8 API
Brand: Featherless.ai
Price: 25.00 USD
Availability: InStock
Author: RedHatAI

Model Overview

RedHatAI/Meta-Llama-3-70B-Instruct-FP8 is a 70 billion parameter, instruction-tuned causal language model based on the Meta-Llama-3 architecture, developed by Neural Magic. This model is a quantized version of the original Meta-Llama-3-70B-Instruct, with both weights and activations quantized to FP8 data types. This optimization significantly reduces the model's disk size and GPU memory footprint by approximately 50% while maintaining strong performance.

Key Capabilities & Optimizations

FP8 Quantization: Utilizes FP8 quantization for weights and activations, making it highly efficient for deployment with backends like vLLM.
Performance Retention: Achieves an average score of 79.16 on the OpenLLM benchmark, demonstrating excellent recovery (99.55%) compared to the unquantized model's 79.51.
Reduced Resource Footprint: Ideal for environments where memory and storage are critical, enabling more efficient inference.

Intended Use Cases

Assistant-like Chat: Primarily designed for commercial and research applications involving assistant-like conversational AI in English.
Efficient Deployment: Suitable for developers looking to deploy a powerful 70B model with reduced hardware requirements, especially with vLLM.

Evaluation Highlights

Evaluated on the OpenLLM Leaderboard, the FP8 quantized model shows strong performance across various benchmarks, including MMLU (80.06), ARC Challenge (72.61), and GSM-8K (91.12), with minimal degradation from the original FP16 model.

Overview

Model Overview

Key Capabilities & Optimizations

Intended Use Cases

Evaluation Highlights

Full Model Card (README)