RedHatAI/gemma-2-9b-it-FP8

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:9BQuant:FP8Ctx Length:16kPublished:Jul 8, 2024License:gemmaArchitecture:Transformer0.0K Warm

RedHatAI/gemma-2-9b-it-FP8 is a 9 billion parameter Gemma 2 model developed by Neural Magic (Red Hat), optimized with FP8 weight and activation quantization. This model is a quantized version of google/gemma-2-9b-it, designed for efficient inference in assistant-like chat applications. It achieves an average OpenLLM benchmark score of 73.49, slightly outperforming its unquantized counterpart while significantly reducing memory footprint.

Loading preview...

Model Overview

RedHatAI/gemma-2-9b-it-FP8 is a 9 billion parameter Gemma 2 model developed by Neural Magic (Red Hat). This model is a quantized version of the google/gemma-2-9b-it instruction-tuned model, specifically optimized for efficient inference.

Key Optimizations

This model features FP8 weight and activation quantization, which significantly reduces the model's disk size and GPU memory requirements by approximately 50% compared to its 16-bit precision equivalent. The quantization process was performed using AutoFP8 with calibration samples from ultrachat, targeting linear operators within transformer blocks.

Performance

Despite the aggressive quantization, the gemma-2-9b-it-FP8 model maintains strong performance. It achieves an average score of 73.49 on the OpenLLM leaderboard (version 1), slightly surpassing the unquantized gemma-2-9b-it model's score of 73.23. Specific benchmark recoveries include:

  • MMLU (5-shot): 99.59% recovery (71.99 vs 72.28)
  • ARC Challenge (25-shot): 100.0% recovery (71.50 vs 71.50)
  • GSM-8K (5-shot): 100.7% recovery (76.87 vs 76.26)

Intended Use Cases

This model is primarily intended for commercial and research use in English for assistant-like chat applications. Its FP8 quantization makes it particularly suitable for scenarios where memory efficiency and faster inference are critical, such as edge deployments or environments with limited GPU resources. It is validated for use on RHOAI 2.20, RHAIIS 3.0, and RHELAI 1.5, and is designed for efficient deployment with backends like vLLM.

Limitations

  • Out-of-scope for use in any manner that violates applicable laws or regulations.
  • Not intended for use in languages other than English.