Entrit/Qwen2.5-7B-qat-d2-6k
Entrit/Qwen2.5-7B-qat-d2-6k is a 7.6 billion parameter language model based on the Qwen2.5-7B architecture, developed by Entrit. This model features balanced ternary quantization at a depth of d=2, achieving 3.47 bits per weight through block-wise Quantization-Aware Training (QAT). It is specifically optimized for efficient inference on hardware that can directly consume packed trit formats, offering significant memory and computational savings.
Loading preview...
Overview
Entrit/Qwen2.5-7B-qat-d2-6k is a 7.6 billion parameter model derived from the Qwen/Qwen2.5-7B base model. Its core innovation lies in its balanced ternary quantization, which reduces the model's weight representation to 3.47 bits per weight. This quantization was achieved using a block-wise Quantization-Aware Training (QAT) method, trained on 6,000 WikiText samples over 500 steps per block.
Key Quantization Details
- Source Model: Qwen/Qwen2.5-7B
- Quantization Depth: d=2 (9 levels per weight)
- Bits per Weight: 3.47
- Method: Block-wise QAT using the
tritllm-codecfrom Entrit. - Quantized Layers: All 2D linear matrices are ternary-quantized.
- FP16 Layers:
lm_head, token embeddings, and all*_normlayers remain in FP16 to preserve accuracy.
Performance and Use Cases
While the model weights are dequantized to FP16 for compatibility with standard transformers libraries, its true efficiency is realized when deployed on hardware capable of processing the packed trit format directly. This makes it particularly suitable for applications requiring reduced memory footprint and faster inference, as detailed in the associated research paper "Balanced Ternary Post-Training Quantization for Large Language Models" by Stentzel (2026). The 3.47-bpw figure reflects the information content of the quantized matrices, aligning with how other quantization schemes report BPW.