RedHatAI/Qwen2-0.5B-Instruct-FP8
TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Jun 14, 2024License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

RedHatAI/Qwen2-0.5B-Instruct-FP8 is a 0.5 billion parameter Qwen2-based instruction-tuned language model developed by Neural Magic. This model is an FP8 quantized version of Qwen2-0.5B-Instruct, optimized for efficient inference with vLLM by reducing memory footprint by approximately 50%. It is intended for assistant-like chat applications in English, maintaining 99.95% of the unquantized model's average performance on the OpenLLM benchmark.

Loading preview...

RedHatAI/Qwen2-0.5B-Instruct-FP8 Overview

This model is a 0.5 billion parameter Qwen2-based instruction-tuned language model, developed by Neural Magic. It is a highly optimized version of the original Qwen2-0.5B-Instruct, specifically designed for efficient deployment and inference.

Key Optimizations and Capabilities

  • FP8 Quantization: The model's weights and activations have been quantized to FP8 data types. This significantly reduces the model's disk size and GPU memory requirements by approximately 50% compared to its 16-bit counterpart.
  • Performance Retention: Despite the aggressive quantization, the model maintains strong performance, achieving an average score of 42.94 on the OpenLLM benchmark (version 1), which is 99.95% of the unquantized model's score (42.96).
  • vLLM Compatibility: It is specifically prepared for efficient inference using the vLLM backend, supporting both direct deployment and OpenAI-compatible serving.
  • English Assistant-like Chat: The model is primarily intended for commercial and research use in English, excelling in assistant-like chat applications.

When to Use This Model

  • Resource-Constrained Environments: Ideal for scenarios where GPU memory and disk space are limited, but strong performance is still required.
  • Efficient Inference: When deploying with vLLM for high-throughput and low-latency inference.
  • English Chat Applications: Suitable for building chatbots and virtual assistants that operate in English.
  • Cost-Effective Deployment: The reduced memory footprint can lead to lower operational costs for inference.