Entrit/Qwen2.5-72B-trit-uniform-d4
Entrit/Qwen2.5-72B-trit-uniform-d4 is a 72.7 billion parameter quantized version of Qwen/Qwen2.5-72B, developed by Entrit Systems. This model utilizes balanced ternary post-training quantization (PTQ) at a depth of d=4, achieving 6.64 bits per weight. It is optimized for efficient inference on hardware capable of consuming packed trit formats, offering significant memory and computational savings for large language model deployments.
Loading preview...
Entrit/Qwen2.5-72B-trit-uniform-d4: Balanced Ternary Quantization
This model is a 72.7 billion parameter version of the Qwen2.5-72B base model, developed by Entrit Systems. It has undergone balanced ternary post-training quantization (PTQ) using a uniform method at a depth of d=4, resulting in 81 levels per weight and an effective information content of 6.64 bits per weight. This quantization is based on research presented in "Balanced Ternary Post-Training Quantization for Large Language Models" (Stentzel, 2026).
Key Quantization Details
- Source Model: Qwen/Qwen2.5-72B
- Quantization Method: Uniform PTQ, applied to all 2D linear matrices.
- Depth: d=4 (81 levels), yielding 6.64 bits per weight.
- Group Size: 16
- FP16 Components:
lm_head, token embeddings, and all*_normlayers remain in FP16 to preserve critical model components. - Codec: Utilizes
tritllm v2for the quantization process.
Performance and Use Cases
While the on-disk size matches the FP16 source due to dequantization for standard transformers compatibility, the 6.64-bpw figure is crucial for inference on specialized hardware that can directly process the packed trit format. This approach aims to reduce the memory footprint and computational requirements during inference, making it suitable for deploying large language models more efficiently. The quantization process and codec are reproducible using the provided tritllm-codec repository.