Model Overview
RedHatAI/Qwen2-72B-Instruct-FP8 is a 72.7 billion parameter Qwen2-based instruction-tuned model, developed by Neural Magic. It is a quantized version of the original Qwen2-72B-Instruct, specifically optimized using FP8 quantization for both weights and activations. This optimization significantly reduces the model's disk size and GPU memory requirements by approximately 50%, making it more efficient for deployment.
Key Optimizations and Performance
This model utilizes symmetric per-tensor quantization, applying a single linear scaling for FP8 representations. The quantization process was performed using AutoFP8 with 512 sequences from UltraChat for calibration. Notably, the FP8 quantized model achieves an average score of 80.34 on the OpenLLM benchmark (version 1), which is a slight improvement over the unquantized Qwen2-72B-Instruct's score of 79.97. This indicates that the quantization not only reduces resource consumption but also maintains or even slightly enhances performance across various tasks like MMLU, ARC Challenge, and GSM-8K.
Intended Use Cases
- Assistant-like Chat: Designed for conversational AI applications, similar to Meta-Llama-3-8B-Instruct.
- Commercial and Research Use: Suitable for both academic research and commercial deployments.
- Efficient Inference: Optimized for deployment with vLLM (version >= 0.5.0), enabling faster and more memory-efficient inference due to FP8 quantization.
Limitations
- Primarily intended for use in English.
- Not recommended for use in ways that violate applicable laws or regulations.