Overview
PKU-DS-LAB's Fairy2i-W2 is a 7 billion parameter language model built upon the LLaMA-2 architecture, distinguished by its innovative approach to extreme low-bit quantization. It introduces Fairy2i, a universal framework that converts pre-trained real-valued layers into an equivalent widely-linear complex form, allowing for highly efficient 2-bit quantization without retraining from scratch.
Key Capabilities & Innovations
- Lossless Widely-Linear Transformation: Converts real-valued linear layers into complex form while preserving original model behavior before quantization.
- Phase-Aware Complex Quantization: Utilizes a unique codebook of fourth roots of unity ({±1, ±i}) for quantizing complex weights, maintaining full-precision master weights during Quantization-Aware Training (QAT).
- Recursive Residual Quantization: Employs a two-stage recursive mechanism to iteratively minimize quantization error, achieving an effective 2 bits per real parameter for Fairy2i-W2.
- Performance: On LLaMA-2 7B, Fairy2i-W2 (2-bit) achieves a perplexity of 7.85 and an average zero-shot accuracy of 62.00%, closely matching FP16 performance (6.63 perplexity, 64.72% accuracy) and significantly outperforming other 2-bit real-valued quantization methods like AQLM and QuIP#.
When to Use This Model
Fairy2i-W2 is ideal for scenarios requiring highly efficient inference of large language models on resource-constrained hardware. Its ability to achieve near full-precision performance at an effective 2-bit precision makes it suitable for deploying LLaMA-2 7B in environments where memory and computational demands are critical. It bridges the gap between the efficiency of complex-valued arithmetic and the practical utility of existing pre-trained real-valued models.