Name: RedHatAI/gemma-2-9b-it-FP8 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: RedHatAI

Model Overview

RedHatAI/gemma-2-9b-it-FP8 is a 9 billion parameter Gemma 2 model developed by Neural Magic (Red Hat). This model is a quantized version of the google/gemma-2-9b-it instruction-tuned model, specifically optimized for efficient inference.

Key Optimizations

This model features FP8 weight and activation quantization, which significantly reduces the model's disk size and GPU memory requirements by approximately 50% compared to its 16-bit precision equivalent. The quantization process was performed using AutoFP8 with calibration samples from ultrachat, targeting linear operators within transformer blocks.

Performance

Despite the aggressive quantization, the gemma-2-9b-it-FP8 model maintains strong performance. It achieves an average score of 73.49 on the OpenLLM leaderboard (version 1), slightly surpassing the unquantized gemma-2-9b-it model's score of 73.23. Specific benchmark recoveries include:

MMLU (5-shot): 99.59% recovery (71.99 vs 72.28)
ARC Challenge (25-shot): 100.0% recovery (71.50 vs 71.50)
GSM-8K (5-shot): 100.7% recovery (76.87 vs 76.26)

Intended Use Cases

This model is primarily intended for commercial and research use in English for assistant-like chat applications. Its FP8 quantization makes it particularly suitable for scenarios where memory efficiency and faster inference are critical, such as edge deployments or environments with limited GPU resources. It is validated for use on RHOAI 2.20, RHAIIS 3.0, and RHELAI 1.5, and is designed for efficient deployment with backends like vLLM.

Limitations

Out-of-scope for use in any manner that violates applicable laws or regulations.
Not intended for use in languages other than English.

Overview

Model Overview

Key Optimizations

Performance

Intended Use Cases

Limitations

Full Model Card (README)