Model Overview
The namgyu-youn/gemma-3-27b-it-AWQ-INT4 is a 27 billion parameter instruction-tuned Gemma model, derived from google/gemma-3-27b-it. This version has been quantized using the AWQ (Activation-aware Weight Quantization) INT4 method, specifically leveraging torchao v0.16.0 for its quantization process. The primary goal of this quantization is to significantly reduce the model's memory footprint and improve inference speed, making it more accessible for deployment on hardware with constrained resources, such as H100+ GPUs as indicated in the original configuration.
Key Characteristics
- Quantization: Utilizes AWQ INT4 (4-bit weight-only quantization) with
torchao for efficiency. - Base Model: Built upon the
google/gemma-3-27b-it instruction-tuned architecture. - Parameter Count: 27 billion parameters.
- Context Length: Supports a context length of 32768 tokens.
- Deployment Focus: Designed for efficient inference, particularly on compatible hardware like H100+ GPUs.
Usage and Limitations
The model is intended for tasks requiring the capabilities of the Gemma-3-27b-it model but with a focus on reduced resource consumption. The README provides a reproduction script for generating this quantized checkpoint and includes benchmark attempts for accuracy (using lm-eval on gsm8k) and throughput (using vLLM). However, it notes that both lm-eval (v0.4.11) and vLLM (v0.15.1) failed to reproduce expected results during benchmarking, indicating potential compatibility issues or specific environment requirements for evaluation. Users should be aware of these reported benchmarking challenges and may need to adapt their evaluation setups accordingly.