Model Overview
RedHatAI/Qwen2-7B-Instruct-FP8 is a 7.6 billion parameter instruction-tuned model based on the Qwen2 architecture, developed by Neural Magic. This model is a quantized version of the original Qwen2-7B-Instruct, specifically optimized with FP8 weight and activation quantization.
Key Optimizations and Capabilities
- Reduced Resource Footprint: By quantizing weights and activations to FP8, the model significantly reduces disk size and GPU memory requirements by approximately 50% compared to its 16-bit precision counterpart.
- Performance Preservation: Despite the aggressive quantization, the model maintains an average score of 69.44 on the OpenLLM benchmark (version 1), which is remarkably close to the unquantized model's 69.55, indicating a 99.84% recovery rate.
- Efficient Deployment: Designed for efficient inference with vLLM >= 0.5.0, supporting streamlined deployment for chat applications.
- Quantization Method: Utilizes AutoFP8 with calibration samples from UltraChat for symmetric per-tensor quantization of linear operators within transformer blocks.
Intended Use Cases
This model is primarily intended for:
- Assistant-like Chat: Suitable for commercial and research applications requiring conversational AI in English.
- Resource-Constrained Environments: Ideal for scenarios where minimizing GPU memory and disk usage is critical without significant performance degradation.
Evaluation Highlights
Evaluated on the OpenLLM leaderboard, the FP8 model demonstrates strong performance across various tasks, with notable recovery rates:
- MMLU (5-shot): 70.27 (99.22% recovery)
- ARC Challenge (25-shot): 62.03 (99.45% recovery)
- GSM-8K (5-shot): 69.83 (101.4% recovery, slightly outperforming the unquantized model)
- Average Score: 69.44 (99.84% recovery)