Name: RedHatAI/Meta-Llama-3-8B-Instruct-FP8 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: RedHatAI

Model Overview

This model, Meta-Llama-3-8B-Instruct-FP8, is a quantized version of the Meta-Llama-3-8B-Instruct, developed by Neural Magic. It features an 8 billion parameter Llama-3 architecture, specifically optimized using FP8 quantization for both weights and activations. This process reduces the model's memory and disk footprint by approximately 50%, making it more efficient for deployment.

Key Optimizations and Performance

FP8 Quantization: Weights and activations of linear operators within transformer blocks are quantized to FP8, enabling efficient inference with vLLM (version 0.5.0 or later).
Performance: The model achieves an average score of 68.22 on the OpenLLM benchmark (version 1), closely matching the unquantized model's 68.71. This represents a recovery rate of 99.28% across various benchmarks like MMLU, ARC Challenge, and GSM-8K.
Creation: Quantization was performed using AutoFP8 with 512 UltraChat sequences for calibration.

Intended Use Cases

Assistant-like Chat: Designed for commercial and research applications requiring English-language assistant capabilities.
Efficient Deployment: Ideal for scenarios where reduced GPU memory and faster inference are critical, especially when deployed with vLLM.

Overview

Model Overview

Key Optimizations and Performance

Intended Use Cases

Full Model Card (README)