RedHatAI/Qwen2-0.5B-Instruct-FP8
TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Jun 14, 2024License:apache-2.0Architecture:Transformer0.0K Open Weights Cold
RedHatAI/Qwen2-0.5B-Instruct-FP8 is a 0.5 billion parameter Qwen2-based instruction-tuned language model developed by Neural Magic. This model is an FP8 quantized version of Qwen2-0.5B-Instruct, optimized for efficient inference with vLLM by reducing memory footprint by approximately 50%. It is intended for assistant-like chat applications in English, maintaining 99.95% of the unquantized model's average performance on the OpenLLM benchmark.
Loading preview...
RedHatAI/Qwen2-0.5B-Instruct-FP8 Overview
This model is a 0.5 billion parameter Qwen2-based instruction-tuned language model, developed by Neural Magic. It is a highly optimized version of the original Qwen2-0.5B-Instruct, specifically designed for efficient deployment and inference.
Key Optimizations and Capabilities
- FP8 Quantization: The model's weights and activations have been quantized to FP8 data types. This significantly reduces the model's disk size and GPU memory requirements by approximately 50% compared to its 16-bit counterpart.
- Performance Retention: Despite the aggressive quantization, the model maintains strong performance, achieving an average score of 42.94 on the OpenLLM benchmark (version 1), which is 99.95% of the unquantized model's score (42.96).
- vLLM Compatibility: It is specifically prepared for efficient inference using the vLLM backend, supporting both direct deployment and OpenAI-compatible serving.
- English Assistant-like Chat: The model is primarily intended for commercial and research use in English, excelling in assistant-like chat applications.
When to Use This Model
- Resource-Constrained Environments: Ideal for scenarios where GPU memory and disk space are limited, but strong performance is still required.
- Efficient Inference: When deploying with vLLM for high-throughput and low-latency inference.
- English Chat Applications: Suitable for building chatbots and virtual assistants that operate in English.
- Cost-Effective Deployment: The reduced memory footprint can lead to lower operational costs for inference.