RedHatAI/Qwen2-1.5B-Instruct-FP8

Warm
Public
1.5B
BF16
131072
License: apache-2.0
Hugging Face
Overview

Model Overview

RedHatAI/Qwen2-1.5B-Instruct-FP8 is a 1.5 billion parameter instruction-tuned causal language model based on the Qwen2 architecture, developed by Neural Magic. This model is a highly optimized version of the original Qwen2-1.5B-Instruct, featuring FP8 weight and activation quantization.

Key Optimizations and Performance

The primary differentiator of this model is its FP8 quantization, which significantly reduces its disk size and GPU memory requirements by approximately 50% compared to its 16-bit counterpart. This optimization is achieved using AutoFP8 with UltraChat calibration samples. Despite this aggressive quantization, the model demonstrates strong performance retention, achieving an average score of 54.59 on the OpenLLM benchmark (version 1), which is only a marginal drop from the unquantized model's 55.18 score (a 98.93% recovery).

Intended Use Cases

  • Efficient Inference: Designed for efficient deployment and inference, particularly with the vLLM backend, making it suitable for resource-constrained environments.
  • Assistant-like Chat: Intended for commercial and research use in English for assistant-like conversational applications.
  • Reduced Resource Footprint: Ideal for scenarios where minimizing memory usage and maximizing throughput are critical.

Limitations

  • Primarily intended for English-language use cases.
  • Out-of-scope for any use violating applicable laws or regulations.