PKU-DS-LAB/Fairy2i-W2

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kLicense:llama2Architecture:Transformer0.0K Open Weights Cold

Fairy2i-W2 by PKU-DS-LAB is a 7 billion parameter language model based on LLaMA-2, featuring an effective 2-bit precision through a novel complex-valued quantization framework. It transforms pre-trained real-valued layers into a widely-linear complex form, enabling extremely low-bit quantization while reusing existing checkpoints. This model is optimized for efficient inference on commodity hardware, achieving performance nearly comparable to full-precision baselines.

Loading preview...

Overview

PKU-DS-LAB's Fairy2i-W2 is a 7 billion parameter language model built upon the LLaMA-2 architecture, distinguished by its innovative approach to extreme low-bit quantization. It introduces Fairy2i, a universal framework that converts pre-trained real-valued layers into an equivalent widely-linear complex form, allowing for highly efficient 2-bit quantization without retraining from scratch.

Key Capabilities & Innovations

  • Lossless Widely-Linear Transformation: Converts real-valued linear layers into complex form while preserving original model behavior before quantization.
  • Phase-Aware Complex Quantization: Utilizes a unique codebook of fourth roots of unity ({±1, ±i}) for quantizing complex weights, maintaining full-precision master weights during Quantization-Aware Training (QAT).
  • Recursive Residual Quantization: Employs a two-stage recursive mechanism to iteratively minimize quantization error, achieving an effective 2 bits per real parameter for Fairy2i-W2.
  • Performance: On LLaMA-2 7B, Fairy2i-W2 (2-bit) achieves a perplexity of 7.85 and an average zero-shot accuracy of 62.00%, closely matching FP16 performance (6.63 perplexity, 64.72% accuracy) and significantly outperforming other 2-bit real-valued quantization methods like AQLM and QuIP#.

When to Use This Model

Fairy2i-W2 is ideal for scenarios requiring highly efficient inference of large language models on resource-constrained hardware. Its ability to achieve near full-precision performance at an effective 2-bit precision makes it suitable for deploying LLaMA-2 7B in environments where memory and computational demands are critical. It bridges the gap between the efficiency of complex-valued arithmetic and the practical utility of existing pre-trained real-valued models.