Model Overview
RedHatAI/gemma-2-9b-it-FP8 is a 9 billion parameter Gemma 2 model developed by Neural Magic (Red Hat). This model is a quantized version of the google/gemma-2-9b-it instruction-tuned model, specifically optimized for efficient inference.
Key Optimizations
This model features FP8 weight and activation quantization, which significantly reduces the model's disk size and GPU memory requirements by approximately 50% compared to its 16-bit precision equivalent. The quantization process was performed using AutoFP8 with calibration samples from ultrachat, targeting linear operators within transformer blocks.
Performance
Despite the aggressive quantization, the gemma-2-9b-it-FP8 model maintains strong performance. It achieves an average score of 73.49 on the OpenLLM leaderboard (version 1), slightly surpassing the unquantized gemma-2-9b-it model's score of 73.23. Specific benchmark recoveries include:
- MMLU (5-shot): 99.59% recovery (71.99 vs 72.28)
- ARC Challenge (25-shot): 100.0% recovery (71.50 vs 71.50)
- GSM-8K (5-shot): 100.7% recovery (76.87 vs 76.26)
Intended Use Cases
This model is primarily intended for commercial and research use in English for assistant-like chat applications. Its FP8 quantization makes it particularly suitable for scenarios where memory efficiency and faster inference are critical, such as edge deployments or environments with limited GPU resources. It is validated for use on RHOAI 2.20, RHAIIS 3.0, and RHELAI 1.5, and is designed for efficient deployment with backends like vLLM.
Limitations
- Out-of-scope for use in any manner that violates applicable laws or regulations.
- Not intended for use in languages other than English.